gff to ffn
How to extract gene sequences from gff files?
Using PanPhlAn
# → download genome fna and gene annotation gff file from NCBI
# run panphlan to add .fna genome sequences to .gff files ( result in folder gff_added_fna/ )
python panphlan_pangenome_generation.py --i_fna ncbi_downloaded_fna_files/ --i_gff ncbi_downloaded_gff_files/
# run panphlan to extract gene sequences from gff files that have the genome sequence included
python panphlan_pangenome_generation.py --i_gff gff_added_fna/
# Result: ffn files containing the gene sequences are in folder: ffn_from_gff/
# geneID in ffn files contains contigID and start stop positions (1-based as in gff) and functional description (product info from gff)
>Filename:contigID:start-stop gff-locus-tag-ID gff-gene-product
Using BED-tools
requires gene annotation .gff file and separated genome .fna file
# 1) extract gene locations from .gff file (saved as bed-file)
# considering all locus tags "CDS", "gene", "tRNA", ...
awk '($3=="CDS" || $3=="gene" || $3=="tRNA" || $3=="tmRNA" || $3=="ncRNA" || $3=="rRNA") {OFS="\t"; print $1,$4-1,$5}' Ecoli.gff > Ecoli.bed
# 2) using bedtools to extract the gene sequences from genome sequence .fna file
# as result: gene sequences are saved as multifasta .ffn file
bedtools getfasta -fi Ecoli.fna -bed Ecoli.bed -fo Ecoli.ffn
Note1: bedtools is using 0-based start positions (gff 1-based are corrected by: $4-1). Location string in geneIDs (.ffn file) is also 0-based.
Note2: This 'awk' & bedtools approach does not provide the reverse complement when the gene is located on the reverse strand. All gene sequences are forward strand oriented.
Using cufflinks gffread
see also