NCBI SRA file format

→ Install SRA-tools (fastq_dump, prefetch,... )

Converting SRA files to fastq

update 2018: consider using the new version → fasterq-dump

fastq-dump can be used for local .sra files or for direct download from NCBI

# local use (path to .sra file)

fastq-dump --split-spot path/to/local/file/SRR649944.sra

SRR649944.fastq

# direct download from NCBI/SRA (only accession number, no path)

fastq-dump --split-3 SRR649944

SRR649944_1.fastq

SRR649944_2.fastq

A .sra file copy will be saved to a local cache/archive folder, used for repeated fastq-dump calls without re-download

$HOME/ncbi/public/sra/SRR649944.sra

Download only

A) fasterq-dump

B) prefetch

Alternatively, prefetch can be used for only downloading the .sra file for later use by fastq-dump

prefetch SRR649944 # stores .sra file in $HOME/ncbi/public/sra/

fastq-dump --split-3 SRR649944 # takes file from $HOME/ncbi/public/sra/ (without download again)

SRR649944_1.fastq

SRR649944_2.fastq

C) wget (not recommended)

Download error

In case of download error, a cache and/or lock file may need to be removed, before trying again

rm $HOME/ncbi/public/sra/SRR649944.sra.cache

rm $HOME/ncbi/public/sra/SRR649944.sra.cache.lock

rm $HOME/ncbi/public/sra/SRR649944.sra.tmp.23569.tmp # prefetch

http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using


fastq-dump options


Extracting fastq files from SRA files, for paired-end reads

fastq-dump --split-3 SAMPLE

results:

SAMPLE_1.fastq

SAMPLE_2.fastq

SAMPLE.fastq (only if .sra contains single reads / single-end sequencing)

--split-3 splits paired reads into files *_1.fastq and *_2.fastq; single read (if any) into *.fastq

SAMPLE can be a SRA-id (download from NCBI or local ncbi/public/sra/ archive) or direct path to local .sra file

fastq-dump --split-3 SRR649944

fastq-dump --split-3 path/to/local/file/SRR649944.sra


Converting SRA files into a single fastq file

fastq-dump --split-spot SAMPLE

results:

SAMPLE.fastq

options:

--split-spot split paired-end reads, but writes all to a single fastq file


To use in a pipe

fastq-dump -Z --split-spot SAMPLE | bowtie2 ...

options:

-Z writes sequences to standard output


Filter read length of SRA samples

fastq-dump --minReadLen 80 --split-3 SAMPLE

fastq-dump --minReadLen 80 --split-spot -Z SAMPLE | bowtie2 ...

options:

--minReadLen 80 extracts only reads >= 80bp from SRA file

www.metagenomics.wiki

read more

http://ncbi.github.io/sra-tools/fastq-dump.html

https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data

http://www.ncbi.nlm.nih.gov/books/NBK47528/

http://www.ncbi.nlm.nih.gov/books/NBK242621/#_SRA_Download_Guid_BK_The_SRA_Toolkit_

http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using



fastq-dump --help

Usage:

fastq-dump [options] <path> [<path>...]

fastq-dump [options] <accession>

INPUT

-A|--accession <accession> Replaces accession derived from <path> in

filename(s) and deflines (only for single

table dump)

--table <table-name> Table name within cSRA object, default is

"SEQUENCE"

PROCESSING

Read Splitting Sequence data may be used in raw form or

split into individual reads

--split-spot Split spots into individual reads

Full Spot Filters Applied to the full spot independently

of --split-spot

-N|--minSpotId <rowid> Minimum spot id

-X|--maxSpotId <rowid> Maximum spot id

--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]

-W|--clip Apply left and right clips

Common Filters Applied to spots when --split-spot is not

set, otherwise - to individual reads

-M|--minReadLen <len> Filter by sequence length >= <len>

-R|--read-filter <[filter]> Split into files by READ_FILTER value

optionally filter by value:

pass|reject|criteria|redacted

-E|--qual-filter Filter used in early 1000 Genomes data: no

sequences starting or ending with >= 10N

--qual-filter-1 Filter used in current 1000 Genomes data

Filters based on alignments Filters are active when alignment

data are present

--aligned Dump only aligned sequences

--unaligned Dump only unaligned sequences

--aligned-region <name[:from-to]> Filter by position on genome. Name can

either be accession.version (ex:

NC_000001.10) or file specific name (ex:

"chr1" or "1"). "from" and "to" are 1-based

coordinates

--matepair-distance <from-to|unknown> Filter by distance beiween matepairs.

Use "unknown" to find matepairs split

between the references. Use from-to to limit

matepair distance on the same reference

Filters for individual reads Applied only with --split-spot set

--skip-technical Dump only biological reads

OUTPUT

-O|--outdir <path> Output directory, default is working

directory '.' )

-Z|--stdout Output to stdout, all split data become

joined into single stream

--gzip Compress output using gzip

--bzip2 Compress output using bzip2

Multiple File Options Setting these options will produce more

than 1 file, each of which will be suffixed

according to splitting criteria.

--split-files Dump each read into separate file.Files

will receive suffix corresponding to read

number

--split-3 Legacy 3-file splitting for mate-pairs:

First biological reads satisfying dumping

conditions are placed in files *_1.fastq and

*_2.fastq If only one biological read is

present it is placed in *.fastq Biological

reads and above are ignored.

-G|--spot-group Split into files by SPOT_GROUP (member name)

-R|--read-filter <[filter]> Split into files by READ_FILTER value

optionally filter by value:

pass|reject|criteria|redacted

-T|--group-in-dirs Split into subdirectories instead of files

-K|--keep-empty-files Do not delete empty files

FORMATTING

Sequence

-C|--dumpcs <[cskey]> Formats sequence using color space (default

for SOLiD),"cskey" may be specified for

translation

-B|--dumpbase Formats sequence using base space (default

for other than SOLiD).

Quality

-Q|--offset <integer> Offset to use for quality conversion,

default is 33

--fasta <[line width]> FASTA only, no qualities, optional line

wrap width (set to zero for no wrapping)

Defline

-F|--origfmt Defline contains only original sequence name

-I|--readids Append read id after spot id as

'accession.spot.readid' on defline

--helicos Helicos style defline

--defline-seq <fmt> Defline format specification for sequence.

--defline-qual <fmt> Defline format specification for quailty.

<fmt> is string of characters and/or

variables. The variables can be one of: $ac

- accession, $si spot id, $sn spot

name, $sg spot group (barcode), $sl spot

length in bases, $ri read number, $rn

read name, $rl read length in bases. '[]'

could be used for an optional output: if

all vars in [] yield empty values whole

group is not printed. Empty value is empty

string or for numeric variables. Ex:

@$sn[_$rn]/$ri '_$rn' is omitted if name

is empty

OTHER:

--disable-multithreading disable multithreading

-h|--help Output brief explanation of program usage

-V|--version Display the version of the program

-L|--log-level <level> Logging level as number or enum string One

of (fatal|sys|int|err|warn|info) or (0-5)

Current/default is warn

-v|--verbose Increase the verbosity level of the program

Use multiple times for more verbosity

--ncbi_error_report Control program execution environment

report generation (if implemented). One of

(never|error|always). Default is error

--legacy-report use legacy style 'Written spots' for tool

fastq-dump : 2.3.4