NCBI ftp genome download

How to download all reference genomes of a selected species from NCBI (Ubuntu/Linux)

1) Download list of all available reference genomes

download complete list of manually reviewed genomes (RefSeq database, subset of GenBank)

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt

or, download list of all available genomes (GenBank), may include bad quality genomes

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

→ read more at NCBI

2) Search for available genomes of a species

Example: Eubacterium rectale (RefSeq database, check columns 8,9,14,15,16)

grep -E 'Eubacterium.*rectale' assembly_summary_refseq.txt | cut -f 8,9,14,15,16

[Eubacterium rectale] ATCC 33656 strain=ATCC 33656 Full 2009/06/04 ASM2060v1

[Eubacterium] rectale strain=2789STDY5608860 Full 2015/10/02 13414_6#44

[Eubacterium] rectale strain=2789STDY5834884 Full 2015/10/02 14207_7#7

[Eubacterium] rectale strain=2789STDY5834968 Full 2015/10/02 14207_7#91

[Eubacterium] rectale strain=T1-815 Full 2015/10/08 T1815

3) Get FTP download link

# for selected genomes (Eubacterium rectale), get NCBI ftp download folder (column 20)

grep -E 'Eubacterium.*rectale' assembly_summary_refseq.txt | cut -f 20 > ftp_folder.txt

head ftp_folder.txt

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815

# extend download folder: create an exact genome (fna or gff) download link

awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "wget "ftpdir,file}' ftp_folder.txt > download_fna_files.sh

awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "wget "ftpdir,file}' ftp_folder.txt > download_gff_files.sh

head download_fna_files.sh

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/605/GCF_000020605.1_ASM2060v1/GCF_000020605.1_ASM2060v1_genomic.fna.gz

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/404/855/GCF_001404855.1_13414_6_44/GCF_001404855.1_13414_6_44_genomic.fna.gz

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/405/295/GCF_001405295.1_14207_7_7/GCF_001405295.1_14207_7_7_genomic.fna.gz

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/375/GCF_001406375.1_14207_7_91/GCF_001406375.1_14207_7_91_genomic.fna.gz

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/406/835/GCF_001406835.1_T1815/GCF_001406835.1_T1815_genomic.fna.gz

4) Run download

download the .fna genome files (fasta format)

source download_fna_files.sh

ls

GCF_000020605.1_ASM2060v1_genomic.fna.gz

GCF_001404855.1_13414_6_44_genomic.fna.gz

GCF_001405295.1_14207_7_7_genomic.fna.gz

GCF_001406375.1_14207_7_91_genomic.fna.gz

GCF_001406835.1_T1815_genomic.fna.gz

# decompress genome files

gzip -d *.gz

ls

GCF_000020605.1_ASM2060v1_genomic.fna

GCF_001404855.1_13414_6_44_genomic.fna

GCF_001405295.1_14207_7_7_genomic.fna

GCF_001406375.1_14207_7_91_genomic.fna

GCF_001406835.1_T1815_genomic.fna

# get description (top line) of genome .fna files (more metadata are in file assembly_summary_refseq.txt)

head -1 *.fna

==> GCF_000020605.1_ASM2060v1_genomic.fna <==

>NC_012781.1 Eubacterium rectale ATCC 33656, complete genome

==> GCF_001404855.1_13414_6_44_genomic.fna <==

>NZ_CYYW01000001.1 [Eubacterium] rectale strain 2789STDY5608860, whole genome shotgun sequence

==> GCF_001405295.1_14207_7_7_genomic.fna <==

>NZ_CZAJ01000001.1 [Eubacterium] rectale strain 2789STDY5834884, whole genome shotgun sequence

==> GCF_001406375.1_14207_7_91_genomic.fna <==

>NZ_CYXM01000001.1 [Eubacterium] rectale strain 2789STDY5834968, whole genome shotgun sequence

==> GCF_001406835.1_T1815_genomic.fna <==

>NZ_CVRQ01000001.1 [Eubacterium] rectale strain T1-815 genome assembly, contig: T1815_10, whole genome shotgun sequence

Alternatively: manual download

Download manually genome.fna files from the NCBI website:

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

NCBI file formats

Sequences

fna - genome sequence, as single or multiple contig nculeotide sequence (fasta format)

ffn - gene sequence (multifasta format), not available anymore

faa - protein amino-acid sequences (fast format)

Annotations

gff - gene annotations (location, function, ...), gff from NCBI does not include sequence

gbff - gene annotations and sequence (genbank format)

gpff - protein annotations and sequence (genbank format)

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#files

see also

→ NCBI FTP FAQ

→ Strain-level metagenomics (PanPhlAn) genome download