→ Install SRA-tools (fastq_dump, prefetch,... )
update 2018: consider using the new version → fasterq-dump
fastq-dump can be used for local .sra files or for direct download from NCBI
# local use (path to .sra file)
fastq-dump --split-spot path/to/local/file/SRR649944.sra SRR649944.fastq # direct download from NCBI/SRA (only accession number, no path) fastq-dump --split-3 SRR649944 SRR649944_1.fastq SRR649944_2.fastq A .sra file copy will be saved to a local cache/archive folder, used for repeated fastq-dump calls without re-download $HOME/ncbi/public/sra/SRR649944.sra Download only 1) fasterq-dump
2) prefetch
Alternatively, prefetch can be used for only downloading the .sra file for later use by fastq-dump
prefetch SRR649944 # stores .sra file in $HOME/ncbi/public/sra/fastq-dump --split-3 SRR649944 # takes file from $HOME/ncbi/public/sra/ (without download again)SRR649944_1.fastq SRR649944_2.fastq Download error
In case of download error, a cache and/or lock file may need to be removed, before trying again
rm $HOME/ncbi/public/sra/SRR649944.sra.cache rm $HOME/ncbi/public/sra/SRR649944.sra.cache.lock rm $HOME/ncbi/public/sra/SRR649944 .sra.tmp.23569.tmp (prefetch) http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using Extracting fastq files from SRA files, for paired-end reads fastq-dump --split-3 SAMPLE results: SAMPLE_1.fastq SAMPLE_2.fastq SAMPLE.fastq (only if .sra contains single reads / single-end sequencing) --split-3 splits paired reads into files *_1.fastq and *_2.fastq; single read (if any) into *.fastq SAMPLE can be a SRA-id (download from NCBI or local ncbi/public/sra/ archive) or direct path to local .sra file fastq-dump --split-3 SRR649944 fastq-dump --split-3 path/to/local/file/SRR649944.sra Converting SRA files into a single fastq file fastq-dump --split-spot SAMPLE results: SAMPLE.fastq options:
--split-spot split paired-end reads, but writes all to a single fastq fileTo use in a pipe
fastq-dump -Z --split-spot SAMPLE | bowtie2 ... options: -Z writes sequences to standard outputFilter read length of SRA samples
fastq-dump --minReadLen 80 --split-3 SAMPLE fastq-dump --minReadLen 80 --split-spot -Z SAMPLE | bowtie2 ... options: --minReadLen 80 extracts only reads >= 80bp from SRA fileread more http://ncbi.github.io/sra-tools/fastq-dump.html https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data http://www.ncbi.nlm.nih.gov/books/NBK47528/ http://www.ncbi.nlm.nih.gov/books/NBK242621/#_SRA_Download_Guid_BK_The_SRA_Toolkit_ http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using fastq-dump --help Usage: fastq-dump [options] <path> [<path>...] fastq-dump [options] <accession> INPUT -A|--accession <accession> Replaces accession derived from <path> in filename(s) and deflines (only for single table dump) --table <table-name> Table name within cSRA object, default is "SEQUENCE" PROCESSING Read Splitting Sequence data may be used in raw form or split into individual reads --split-spot Split spots into individual reads Full Spot Filters Applied to the full spot independently of --split-spot -N|--minSpotId <rowid> Minimum spot id -X|--maxSpotId <rowid> Maximum spot id --spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...] -W|--clip Apply left and right clips Common Filters Applied to spots when --split-spot is not set, otherwise - to individual reads -M|--minReadLen <len> Filter by sequence length >= <len> -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -E|--qual-filter Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N --qual-filter-1 Filter used in current 1000 Genomes data Filters based on alignments Filters are active when alignment data are present --aligned Dump only aligned sequences --unaligned Dump only unaligned sequences --aligned-region <name[:from-to]> Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates --matepair-distance <from-to|unknown> Filter by distance beiween matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepair distance on the same reference Filters for individual reads Applied only with --split-spot set --skip-technical Dump only biological reads OUTPUT -O|--outdir <path> Output directory, default is working directory '.' ) -Z|--stdout Output to stdout, all split data become joined into single stream --gzip Compress output using gzip --bzip2 Compress output using bzip2 Multiple File Options Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria. --split-files Dump each read into separate file.Files will receive suffix corresponding to read number --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored. -G|--spot-group Split into files by SPOT_GROUP (member name) -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -T|--group-in-dirs Split into subdirectories instead of files -K|--keep-empty-files Do not delete empty files FORMATTING Sequence -C|--dumpcs <[cskey]> Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation -B|--dumpbase Formats sequence using base space (default for other than SOLiD). Quality -Q|--offset <integer> Offset to use for quality conversion, default is 33 --fasta <[line width]> FASTA only, no qualities, optional line wrap width (set to zero for no wrapping) Defline -F|--origfmt Defline contains only original sequence name -I|--readids Append read id after spot id as 'accession.spot.readid' on defline --helicos Helicos style defline --defline-seq <fmt> Defline format specification for sequence. --defline-qual <fmt> Defline format specification for quailty. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty OTHER: --disable-multithreading disable multithreading -h|--help Output brief explanation of program usage -V|--version Display the version of the program -L|--log-level <level> Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5) Current/default is warn -v|--verbose Increase the verbosity level of the program Use multiple times for more verbosity --ncbi_error_report Control program execution environment report generation (if implemented). One of (never|error|always). Default is error --legacy-report use legacy style 'Written spots' for tool fastq-dump : 2.3.4 |
Tools > Shotgun sequencing >