Remove host sequences
In silico removal of host sequences
removing host (contamination) sequences in order to analyze remaining (bacterial) sequences
1) Using bowtie2 option: --un-conc
quick solution to get the paired reads that do not map to the host reference genome (both reads unmapped).
# download ready to use bowtie2 database of human host genome GRCh38 (hg38)
# bowtie2 results (gz files without gz ending)
# rename host-sequence free samples
Option --un-conc shows results like samtools options -F 2 (excluding reads "mapped in proper pair").
Paired reads that do not map both to the host sequence might still be included in the "host removed" output.
For better control about read filtering options, see workflow below.
If multi-processor option -p is used, output reads might have a different order compared to input files.
Use option --reorder to keep the original read order.
(read order refers to .sam output but might effect also host-removed read output files .1 .2)
2) Using bowtie2 together with samtools
complex solution that gives better control over the rejected reads by using SAM-flags
How to filter out host reads from paired-end fastq files?
a) bowtie2 mapping against host genome: write all (mapped and unmapped) reads to a single .bam file
b) samtools view: use filter-flags to extract unmapped reads
c) samtools fastq: split paired-end reads into separated R1 and R2 fastq files
a) bowtie2 mapping against host sequence
1) create or download ready to use bowtie2 index databases (host_DB)
# create bowtie2 database using a reference genome (fasta file)
# download ready to use bowtie2 database of human genome GRCh38 (hg38)
# move all files into your working directory (or into your predefined $BOWTIE2_INDEXES location)
# use "GRCh38_noalt_as" for your bowtie2 database name (instead of "host_DB")
2) bowtie2 mapping against host sequence database, keep both aligned and unaligned reads (paired-end reads)
3) convert file .sam to .bam
b) filter required unmapped reads
# SAMtools SAM-flag filter: get unmapped pairs (both reads R1 and R2 unmapped)
-F 256 Do not(-F) extract alignments which are: <not primary alignment>
see meaning of SAM-flags
c) split paired-end reads into separated fastq files .._R1 .._R2
# sort bam file by read name (-n) to have paired reads next to each other (2 parallel threads, each using up to 5G memory)
see other options for → converting .bam to .fastq
Two files of paired-end reads, containing non-host sequences
Removel of ribosomal RNA
To in silico remove rRNA, see software tool → SortMeRNA