NCBI ftp genome download
How to download all reference genomes of a selected species from NCBI
Ubuntu / Linux command line terminal
1) Download list of all available reference genomes
download complete list of manually reviewed genomes (RefSeq database which is a subset of GenBank)
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt ./
or, download list of all available genomes (GenBank), may include bad quality genomes
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt ./
2) Search for available genomes of a species
Example: Eubacterium limosum (RefSeq database, check columns 8,9,14,15,16)
grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 8,9,14,15,16
Eubacterium limosum strain=ATCC 8486 Full 2017/04/03 ASM80767v2
Eubacterium limosum strain=SA11 Full 2015/12/23 ASM148172v1
Eubacterium limosum strain=8486cho Full 2018/05/31 ASM318251v1
Eubacterium limosum strain=DFI.6.107 Full 2021/10/25 ASM2055962v1
Eubacterium limosum strain=B2 Full 2022/05/23 ASM2352075v1
Eubacterium limosum Full 2019/02/19 ASM90068377v1
3) Get FTP download link
# for selected genomes (Eubacterium limosum ), get NCBI ftp download folder (column 20)
grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 20 > ftp_links.txt
head ftp_links.txt
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1
# extend download link: create an exact genome (fna or gff) download link
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "rsync -t -v "ftpdir,file" ./"}' ftp_links.txt | sed 's/https/rsync/g' > download_fna_files.sh
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "rsync -t -v "ftpdir,file" ./"}' ftp_links.txt | sed 's/https/rsync/g' > download_gff_files.sh
head download_fna_files.sh
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2/GCF_000807675.2_ASM80767v2_genomic.fna.gz ./
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1/GCF_001481725.1_ASM148172v1_genomic.fna.gz ./
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1/GCF_003182515.1_ASM318251v1_genomic.fna.gz ./
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1/GCF_020559625.1_ASM2055962v1_genomic.fna.gz ./
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1/GCF_023520755.1_ASM2352075v1_genomic.fna.gz ./
rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1/GCF_900683775.1_ASM90068377v1_genomic.fna.gz ./
4) Run download
download the .fna genome files (fasta format)
source download_fna_files.sh
ls
download_fna_files.sh
GCF_000807675.2_ASM80767v2_genomic.fna.gz
GCF_001481725.1_ASM148172v1_genomic.fna.gz
GCF_003182515.1_ASM318251v1_genomic.fna.gz
GCF_020559625.1_ASM2055962v1_genomic.fna.gz
GCF_023520755.1_ASM2352075v1_genomic.fna.gz
GCF_900683775.1_ASM90068377v1_genomic.fna.gz
# get description (top line) of genome .fna files (more metadata are in file assembly_summary_refseq.txt)
find . -name "*.fna.gz" -exec sh -c "echo -n '{}: '; zcat {} | head -1" \;
./GCF_000807675.2_ASM80767v2_genomic.fna.gz: >NZ_CP019962.1 Eubacterium limosum strain ATCC 8486 chromosome, complete genome
./GCF_001481725.1_ASM148172v1_genomic.fna.gz: >NZ_CP011914.1 Eubacterium limosum strain SA11 chromosome, complete genome
./GCF_003182515.1_ASM318251v1_genomic.fna.gz: >NZ_QGUD01000001.1 Eubacterium limosum strain 8486cho Ga0206405_101, whole genome shotgun sequence
./GCF_020559625.1_ASM2055962v1_genomic.fna.gz: >NZ_JAJCLO010000001.1 Eubacterium limosum strain DFI.6.107 IMADOJIF_1, whole genome shotgun sequence
./GCF_023520755.1_ASM2352075v1_genomic.fna.gz: >NZ_CP097376.1 Eubacterium limosum strain B2 chromosome, complete genome
./GCF_900683775.1_ASM90068377v1_genomic.fna.gz: >NZ_LR215983.1 Eubacterium limosum isolate Eubacterium limosum 81C1 chromosome 1
Alternative: manual ftp download
Download manually genome.fna files from the NCBI website:
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
NCBI file formats
Sequences
fna - genome sequence, as single or multiple contig nculeotide sequence (fasta format)
ffn - gene sequence (multifasta format), not available anymore
faa - protein amino-acid sequences (fast format)
Annotations
gff - gene annotations (location, function, ...), gff from NCBI does not include sequence
gbff - gene annotations and sequence (genbank format)
gpff - protein annotations and sequence (genbank format)
https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#files