In silico removal of host sequences
removing host (contamination) sequences in order to analyze remaining (bacterial) sequences
How to filter out host reads from paired-end fastq files?
Steps:
a) bowtie2 mapping against host: write all (mapped/unmapped) reads to a single .bam file
b) Samtools: use flags to extract unmapped reads
c) bedtools: split paired-end reads into separated files
a) bowtie2 mapping against host sequence
Host example: human genome hg19 (download bowtie2 hg19 index)
# 1) create bowtie2 index database (host_DB) from host reference genome
bowtie2-build host_genome.fna host_DB
# 2) bowtie2 mapping against host sequence database, keep both mapped and unmapped reads (paired-end reads)
bowtie2 -x host_DB -1 SAMPLE_r1.fastq -2 SAMPLE_r2.fastq -S SAMPLE_mapped_and_unmapped.sam
# 3) convert file .sam to .bam
samtools view -bS SAMPLE_mapped_and_unmapped.sam > SAMPLE_mapped_and_unmapped.bam
b) filter required unmapped reads
# SAMtools SAM-flag filter: get unmapped pairs (both ends unmapped)
samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothEndsUnmapped.bam
-f 12 Extract only (-f ) alignments with both reads unmapped: <read unmapped><mate unmapped>
-F 256 Do not (-F ) extract alignments which are: <not primary alignment>
see meaning of SAM-flags
c) split paired-end reads into separated fastq files .._r1 .._r2
# sort bam file by read name (-n) to have paired reads next to each other as required by bedtools
samtools sort -n SAMPLE_bothEndsUnmapped.bam SAMPLE_bothEndsUnmapped_sorted
bedtools bamtofastq -i SAMPLE_bothEndsUnmapped_sorted.bam -fq SAMPLE_host_removed_r1.fastq -fq2 SAMPLE_host_removed_r2.fastq
Result
Two files of paired-end reads, containing non-host sequences
SAMPLE_host_removed_r1.fastq
SAMPLE_host_removed_r2.fastq
Notes
As alternative, bowtie2 option --un-conc can be used to directly get unmapped reads.
But SAM-flags gives more flexibility in extracting the required reads, including
- correctly mapped pairs < read mapped in proper pair>
(mapped distance as expected and correctly oriented to each other)
- keeping or rejecting pairs in which only one read is mapped
To in silico remove rRNA, see software tool → SortMeRNA
|
|