Tools‎ > ‎Sequence data‎ > ‎NCBI ftp genome download‎ > ‎

gff to ffn

How to extract gene sequences from gff files?

Using PanPhlAn


# run panphlan to add .fna genome sequences to .gff files ( result in folder gff_added_fna/ )
python panphlan_pangenome_generation.py --i_fna ncbi_downloaded_fna_files/ --i_gff ncbi_downloaded_gff_files/

# run panphlan to extract gene sequences from gff files that have the genome sequence included
python panphlan_pangenome_generation.py --i_gff gff_added_fna/

# Result: ffn files containing the gene sequences are in folder: ffn_from_gff/
# geneID in ffn files contains contigID and start stop positions (1-based as in gff) and functional description (product info from gff)
    >Filename:contigID:start-stop gff-locus-tag-ID gff-gene-product

Using BED-tools

requires gene annotation .gff file and separated genome .fna file

# 1) extract gene locations from .gff file (saved as bed-file)
#     considering all locus tags "CDS", "gene", "tRNA", ...
awk '($3=="CDS" || $3=="gene" || $3=="tRNA" || $3=="tmRNA" || $3=="ncRNA" || $3=="rRNA") {OFS="\t"; print $1,$4-1,$5}' Ecoli.gff > Ecoli.bed

# 2) using bedtools to extract the gene sequences from genome sequence .fna file
#     as result: gene sequences are saved as multifasta .ffn file
bedtools getfasta -fi Ecoli.fna -bed Ecoli.bed -fo Ecoli.ffn

Note1: bedtools is using 0-based start positions (gff 1-based are corrected by: $4-1). Location string in geneIDs (.ffn file) is also 0-based.
Note2: This 'awk' & bedtools approach does not provide the reverse complement when the gene is located on the reverse strand. All gene sequences are forward strand oriented.






see also