gff to ffn

How to extract gene sequences from gff files?

# → download genome fna and gene annotation gff file from NCBI

# run panphlan to add .fna genome sequences to .gff files ( result in folder gff_added_fna/ )

python --i_fna ncbi_downloaded_fna_files/ --i_gff ncbi_downloaded_gff_files/

# run panphlan to extract gene sequences from gff files that have the genome sequence included

python --i_gff gff_added_fna/

# Result: ffn files containing the gene sequences are in folder: ffn_from_gff/

# geneID in ffn files contains contigID and start stop positions (1-based as in gff) and functional description (product info from gff)

>Filename:contigID:start-stop gff-locus-tag-ID gff-gene-product

requires gene annotation .gff file and separated genome .fna file

# 1) extract gene locations from .gff file (saved as bed-file)

# considering all locus tags "CDS", "gene", "tRNA", ...

awk '($3=="CDS" || $3=="gene" || $3=="tRNA" || $3=="tmRNA" || $3=="ncRNA" || $3=="rRNA") {OFS="\t"; print $1,$4-1,$5}' Ecoli.gff > Ecoli.bed

# 2) using bedtools to extract the gene sequences from genome sequence .fna file

# as result: gene sequences are saved as multifasta .ffn file

bedtools getfasta -fi Ecoli.fna -bed Ecoli.bed -fo Ecoli.ffn

Note1: bedtools is using 0-based start positions (gff 1-based are corrected by: $4-1). Location string in geneIDs (.ffn file) is also 0-based.

Note2: This 'awk' & bedtools approach does not provide the reverse complement when the gene is located on the reverse strand. All gene sequences are forward strand oriented.

see also

→ gff specification