Ubuntu / Linux command line terminal
Install the NCBI download tool (via conda)
conda create -n NCBI-tools -c bioconda -c conda-forge ncbi-datasets-cli
Download the human reference genome, GRCh38 (NCBI accession GCF_000001405)
conda activate NCBI-tools
datasets download genome accession GCF_000001405 --filename human_GRCh38_dataset.zip
conda deactivate
Command-specific help
datasets download --help
datasets download genome --help
datasets download genome accession --help
Download data types
genome, gene, taxonomy, virus
Available Commands (genome)
genome accession Download a genome data package by Assembly or BioProject accession
genome taxon Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)
Example Commands (genome)
datasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rna
datasets download genome taxon "Eubacterium limosum"
download complete list of manually reviewed genomes (RefSeq database which is a subset of GenBank)
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt
or, download list of all available genomes (GenBank), may include bad quality genomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt
Example: Eubacterium limosum (RefSeq database, check columns 8,9,14,15,16)
grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 8,9,14,15,16
Eubacterium limosum strain=ATCC 8486 Full 2017/04/03 ASM80767v2
Eubacterium limosum strain=SA11 Full 2015/12/23 ASM148172v1
Eubacterium limosum strain=8486cho Full 2018/05/31 ASM318251v1
Eubacterium limosum strain=DFI.6.107 Full 2021/10/25 ASM2055962v1
Eubacterium limosum strain=B2 Full 2022/05/23 ASM2352075v1
Eubacterium limosum Full 2019/02/19 ASM90068377v1
# for selected genomes (Eubacterium limosum ), get NCBI ftp download folder (column 20)
grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 20 > ftp_links.txt
head ftp_links.txt
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1
# extend download link: create an exact genome (fna or gff) download link
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "wget "ftpdir,file}' ftp_links.txt > download_fna_files.sh
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "wget "ftpdir,file}' ftp_links.txt > download_gff_files.sh
head download_fna_files.sh
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2/GCF_000807675.2_ASM80767v2_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1/GCF_001481725.1_ASM148172v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1/GCF_003182515.1_ASM318251v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1/GCF_020559625.1_ASM2055962v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1/GCF_023520755.1_ASM2352075v1_genomic.fna.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1/GCF_900683775.1_ASM90068377v1_genomic.fna.gz
download the .fna genome files (fasta format)
source download_fna_files.sh
ls
download_fna_files.sh
GCF_000807675.2_ASM80767v2_genomic.fna.gz
GCF_001481725.1_ASM148172v1_genomic.fna.gz
GCF_003182515.1_ASM318251v1_genomic.fna.gz
GCF_020559625.1_ASM2055962v1_genomic.fna.gz
GCF_023520755.1_ASM2352075v1_genomic.fna.gz
GCF_900683775.1_ASM90068377v1_genomic.fna.gz
# get description (top line) of genome .fna files (more metadata are in file assembly_summary_refseq.txt)
find . -name "*.fna.gz" -exec sh -c "echo -n '{}: '; zcat {} | head -1" \;
./GCF_000807675.2_ASM80767v2_genomic.fna.gz: >NZ_CP019962.1 Eubacterium limosum strain ATCC 8486 chromosome, complete genome
./GCF_001481725.1_ASM148172v1_genomic.fna.gz: >NZ_CP011914.1 Eubacterium limosum strain SA11 chromosome, complete genome
./GCF_003182515.1_ASM318251v1_genomic.fna.gz: >NZ_QGUD01000001.1 Eubacterium limosum strain 8486cho Ga0206405_101, whole genome shotgun sequence
./GCF_020559625.1_ASM2055962v1_genomic.fna.gz: >NZ_JAJCLO010000001.1 Eubacterium limosum strain DFI.6.107 IMADOJIF_1, whole genome shotgun sequence
./GCF_023520755.1_ASM2352075v1_genomic.fna.gz: >NZ_CP097376.1 Eubacterium limosum strain B2 chromosome, complete genome
./GCF_900683775.1_ASM90068377v1_genomic.fna.gz: >NZ_LR215983.1 Eubacterium limosum isolate Eubacterium limosum 81C1 chromosome 1
Download manually genome.fna files from the NCBI website:
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
Sequences
fna - genome sequence, as single or multiple contig nculeotide sequence (fasta format)
ffn - gene sequence (multifasta format), not available anymore
faa - protein amino-acid sequences (fast format)
Annotations
gff - gene annotations (location, function, ...), gff from NCBI does not include sequence
gbff - gene annotations and sequence (genbank format)
gpff - protein annotations and sequence (genbank format)
https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#files