Using PanPhlAn# run panphlan to add .fna genome sequences to .gff files ( result in folder gff_added_fna/ ) python panphlan_pangenome_generation.py --i_fna ncbi_downloaded_fna_files/ --i_gff ncbi_downloaded_gff_files/ # run panphlan to extract gene sequences from gff files that have the genome sequence included python panphlan_pangenome_generation.py --i_gff gff_added_fna/ # Result: ffn files containing the gene sequences are in folder:
ffn_from_gff/ # geneID in ffn files contains contigID and start stop positions (1-based as in gff) and functional description (product info from gff) >Filename:contigID:start-stop gff-locus-tag-ID gff-gene-product
Using BED-toolsrequires gene annotation .gff file and separated genome .fna file
# considering all locus tags "CDS", "gene", "tRNA", ... awk '($3=="CDS" || $3=="gene" || $3=="tRNA" || $3=="tmRNA" || $3=="ncRNA" || $3=="rRNA") {OFS="\t"; print $1,$4-1,$5}' Ecoli.gff > Ecoli.bed # 2) using bedtools to extract the gene sequences from genome sequence .fna file
# as result: gene sequences are saved as multifasta .ffn file
bedtools getfasta -fi Ecoli.fna -bed Ecoli.bed -fo Ecoli.ffn Note1: bedtools is using 0-based start positions (gff 1-based are corrected by: $4-1). Location string in geneIDs (.ffn file) is also 0-based.
Note2: This 'awk' & bedtools approach does not provide the reverse complement when the gene is located on the reverse strand. All gene sequences are forward strand oriented.
Using cufflinks gffreadsee also
|