gff to ffn
How to extract gene sequences from gff files?
# run panphlan to add .fna genome sequences to .gff files ( result in folder gff_added_fna/ )
# run panphlan to extract gene sequences from gff files that have the genome sequence included
# Result: ffn files containing the gene sequences are in folder: ffn_from_gff/
# geneID in ffn files contains contigID and start stop positions (1-based as in gff) and functional description (product info from gff)
>Filename:contigID:start-stop gff-locus-tag-ID gff-gene-product
requires gene annotation .gff file and separated genome .fna file
# 1) extract gene locations from .gff file (saved as bed-file)
# considering all locus tags "CDS", "gene", "tRNA", ...
# 2) using bedtools to extract the gene sequences from genome sequence .fna file
# as result: gene sequences are saved as multifasta .ffn file
bedtools getfasta -fi Ecoli.fna -bed Ecoli.bed -fo Ecoli.ffn
Note1: bedtools is using 0-based start positions (gff 1-based are corrected by: $4-1). Location string in geneIDs (.ffn file) is also 0-based.
Note2: This 'awk' & bedtools approach does not provide the reverse complement when the gene is located on the reverse strand. All gene sequences are forward strand oriented.
Using cufflinks gffread