Search this site
Embedded Files
Metagenomics
  • Metagenomics
    • Taxonomy
      • Alpha and beta diversity
      • Strain level
      • Pangenome
      • Marker genes
        • MLST
      • 16S
        • Operational taxonomic unit (OTU)
        • 16s vs shotgun
    • Phylogeny
      • Horizontal gene transfer (HGT)
      • Lowest Common Ancestor (LCA)
      • Long branch attraction (LBA)
    • Evolution
    • Ecosystem
      • Environmental gene tags (EGTs)
      • Soil metagenomics
    • Metatranscriptomics
      • RNA-seq versus microarray
    • QC
      • Coverage depth
      • Sampling depth
        • Calculate sampling depth
      • RPKM calculation
      • Phred score (Q score)
      • Giga base pairs
      • GC content
    • Wiki
      • Microbiome
      • Orthologs and paralogs
      • Short read mapping
      • Fecal microbiota transplant
      • DNA sequencing library
      • Multiplex sequencing
      • x
      • x
      • x
      • x
        • x
      • x
      • x
      • x
  • Tools
    • 16S tools
      • DADA2
        • Conda environment
          • Segmentation fault
      • Normalization
        • CSS
      • QIIME
        • Alpha & Beta diversity
        • Install QIIME 1
          • AttributeError:axisbg
          • biom error
          • Greengenes
          • h5py BIOM error
          • PackagesNotFoundError
          • qiime_config
          • R for QIIME
          • usearch61
        • OTU biom table
          • QIIME: split OTU table
          • ValueError column index exceeds matrix dimensions
        • QIIME mapfile
        • QIIME OTU clustering
          • Error running usearch61
          • Fungi 18S
          • IOError Errno 28 No space left on device
          • SILVA
        • QIIME pre-processing
          • Cannot find fastq-join
          • join-error
          • split-error
        • Taxonomy
        • Test statistics
    • Assembly
      • Contigs
      • N50 statistics
      • MEGAHIT
    • BLAST
      • BLAST error: Too many positional arguments
      • BLAST word-size
      • BLASTn output format 6
      • BLASTx
      • E-value & Bit-score
      • FastANI
        • Install FastANI
          • Install autoconf
          • Install GLS
      • Generate_database
        • Public sequences
      • Install BLAST
      • megablast
    • Bowtie2
      • Create bowtie2 index
      • Install bowtie2
        • libtbb.so.2
        • Segmentation fault
    • Genome
      • Prokka
    • Pathogen screening
    • Phylogenetic tree
      • File format
      • Sequence-Alignment
        • Alignment viewer
      • Tree-construction
        • FastTree
        • RAxML
          • Add bipartition
          • Install RAxML
      • Tree-viewer
        • Forester
          • Java error
    • SAMtools
      • BCFtools
      • Consensus sequence
      • Converting BAM to fastq
      • Error
        • Could not parse the header line
        • different line length in sequence
      • Install
        • Error curses.h
      • Number of reads in bam file
      • SAM file format
      • SAMtools: get breadth of coverage
    • Sequence data
      • Convert fastq to fasta
      • Extract sequence subset
      • Get random subset
      • Multi-FASTA format
      • NCBI ftp genome download
        • gff to ffn
    • Shotgun sequencing
      • Alignment
        • Viewer
          • Tablet
      • Data
      • NCBI SRA files
        • Install SRA-tools
        • prefetch
        • Error
          • path not found while resolving tree within virtual file system module
        • wget download
        • old fastq-dump
      • Quality control
        • Trim Galore
      • Remove host sequences
      • Remove too short reads from fastq files
      • Fastq file format
    • Ubuntu Linux
      • awk
      • bzip2
      • Extract columns from file
      • File properties
      • gzip & tar.gz
      • Loop over list of files
        • find
        • if greater than
        • String split
      • Rename multiple files
      • rsync
      • sed
      • Shell stderr redirect
      • Show disk space usage
      • Split large file
      • Data tables
        • Transpose table
        • Sorting
    • HPC SGE cluster
      • Submit job
      • Check cluster nodes
Metagenomics

awk

Ubuntu - Command Line Tools

for screening, filtering, and extracting rows and columns of text tables

awk examples

print column 3 and 4 of file 'table.txt'

awk '{ print $3, $4 }' table.txt   

file input/output: read from file 'table.txt' and write to file 'newTable.txt'

awk < table.txt '{ print $3, $4 }' > newTable.txt  

to use in a pipe

cat table.txt | awk '{ print $3, $4 }' > newTable.txt

print only columns 3 and 4 when column 3 contains "gene123"

awk '$3=="gene123"{ print $3, $4}' table.txt

print complete line when column 3 contains "gene123"

awk '$3=="gene123"' table.txt

print complete line when column 3 contains NOT "NA" (  != means NOT ),  alternatively use grep, see below

awk '$3!="NA"' table.txt

print lines in which the ratio between columns 4 and 5 is smaller than 0.4

awk '$4/$5<0.4' table.txt

print columns separated by tab

change output field separator (OFS) to tabular "\t", instead of using space by default

additionally, fix also input field separator (-F) to tabular "\t", in case of empty fields 

awk -F"\t" '$3=="gene123"{OFS = "\t"; print $3, $4}' table.txt

switch column 3 and 4

awk '{tmp=$3; $3=$4; $4=tmp}; {print }'

switch only in lines where field 3 is larger than field 4

awk '$3>$4{tmp=$3; $3=$4; $4=tmp;}; {print }'

print columns in any newly defined order

awk '{ print $2, $1, $4, $3 }'

For extracting columns from text files, see also: cut


Combine multiple conditions

AND operator: &&

awk '$3=="gene01" && $4>95 { print $3, $4}' table.txt

OR operator: ||

awk '$3=="gene01" || $3=="gene02" { print $3, $4}' table.txt

Count number of columns in each line

# file separator 'FS' can be replaced by any column separator FS="\t" or FS="|" or  FS=":" ...

awk '{ FS = "\t" } ; { print NF}'  table.txt

Sum and average (mean) of numbers in column 2

awk 'BEGIN{C=0}; {C=C+$2}; END{print C "\t" C/NR}' table.csv

 2345 56

For getting word frequencies, see also: uniq -c


FASTQ files

# Convert FASTQ sequence-IDs into older Illumina sequence name format for paired reads (ending with "/1" or "/2")

# awk: add "/1" , "/2" every 4th line, but only if line starts with "@"

cat SAMPLE_R1.fastq | awk '{ if (NR % 4 == 1 && /^@/) {$0=$0"/1"}; {print} }'

cat SAMPLE_R2.fastq | awk '{ if (NR % 4 == 1 && /^@/) {$0=$0"/2"}; {print} }'

→ FASTQ file format


Remove lines

using sed or grep

# remove all empty lines (d - delete)

sed '/^$/d' textfile.txt > new.txt

# remove all lines containing 'NA'

grep -vw NA textfile.txt > new.txt

# remove all lines that contain multiple strings from the unwanted word list file

grep -Fwf unwanted_word_list.txt -v textfile.txt > new.txt

  -w search for full word, not substring

  -F read pattern as plain text, not interpreting special characters as regular expressions

  -f read pattern from file

  -v keep non-matching lines (remove lines that contain the searched word)

www.metagenomics.wiki

Author: Matthias Scholz

Google Sites
Report abuse
Page details
Page updated
Google Sites
Report abuse