SAMtools: get breadth of coverage

How to get breadth of coverage of a reference genome.

How many bases of my reference genome are covered by my short reads sample at a level of 5X or higher?

# set shell variables to your file-names

REF_GENOME_FILE=my_ref_genome.fasta

WGS_SAMPLE=my_short_read_sample.fastq

MIN_COVERAGE_DEPTH=5

1) Map short reads against reference genome (bowtie2)

# create bowtie2 index database (database name: refgenome)

bowtie2-build ${REF_GENOME_FILE} refgenome

ls

refgenome.1.bt2

refgenome.2.bt2

refgenome.3.bt2

refgenome.4.bt2

refgenome.rev.1.bt2

refgenome.rev.2.bt2


# map reads and sort bam file

bowtie2 -x refgenome --no-unal -U ${WGS_SAMPLE} -S - -p 12 | \

samtools view -bS - | \

samtools sort -m 5G - mapping_result_sorted.bam

-U ${WGS_SAMPLE} consider all reads as unpaired reads

-S - bowtie output in SAM format, write to standard out

-x refgenome use reference genome bowtie2-database

--no-unal suppress SAM records for reads that failed to align

-p 12 use 12 processors

2) Get breadth of coverage (SAMtools)

# create samtools index

samtools index mapping_result_sorted.bam

# get total number of bases covered at MIN_COVERAGE_DEPTH or higher

samtools mpileup mapping_result_sorted.bam | awk -v X="${MIN_COVERAGE_DEPTH}" '$4>=X' | wc -l

32876

# get length of reference genome

bowtie2-inspect -s refgenome | awk '{ FS = "\t" } ; BEGIN{L=0}; {L=L+$3}; END{print L}'

45678


Result: breadth of reference genome coverage

total number of covered bases: 32876 (with >= 5X coverage depth)

→ Depth of coverage (average per-base coverage): 0.719 X = 32876 ÷ 45678 (total number of covered bases divided by reference genome length)

percent: 71.9% = 0.719 × 100 ( → coverage breadth: covered genome length in percent)