for screening, filtering, and extracting rows and columns of text tables

awk examples

print column 3 and 4 of file 'table.txt'

awk '{ print $3, $4 }' table.txt   

file input/output: read from file 'table.txt' and write to file 'newTable.txt'

awk < table.txt '{ print $3, $4 }' > newTable.txt  

to use in a pipe

cat table.txt | awk '{ print $3, $4 }' > newTable.txt

print only columns 3 and 4 when column 3 contains "gene123"

awk '$3=="gene123"{ print $3, $4}' table.txt

print complete line when column 3 contains "gene123"

awk '$3=="gene123"' table.txt

print complete line when column 3 contains NOT "NA" (  != means NOT ),  alternatively use grep, see below

awk '$3!="NA"' table.txt

print lines in which the ratio between columns 4 and 5 is smaller than 0.4

awk '$4/$5<0.4' table.txt

print columns separated by tab

change output field separator (OFS) to tabular "\t", instead of using space by default

additionally, fix also input field separator (-F) to tabular "\t", in case of empty fields 

awk -F"\t" '$3=="gene123"{OFS = "\t"; print $3, $4}' table.txt

switch column 3 and 4

awk '{tmp=$3; $3=$4; $4=tmp}; {print }'

switch only in lines where field 3 is larger than field 4

awk '$3>$4{tmp=$3; $3=$4; $4=tmp;}; {print }'

print columns in any newly defined order

awk '{ print $2, $1, $4, $3 }'

For extracting columns from text files, see also: cut


Combine multiple conditions

AND operator: &&

awk '$3=="gene01" && $4>95 { print $3, $4}' table.txt

OR operator: ||

awk '$3=="gene01" || $3=="gene02" { print $3, $4}' table.txt

Count number of columns in each line

# file separator 'FS' can be replaced by any column separator FS="\t" or FS="|" or  FS=":" ...

awk '{ FS = "\t" } ; { print NF}'  table.txt

Sum and average (mean) of numbers in column 2

awk 'BEGIN{C=0}; {C=C+$2}; END{print C "\t" C/NR}' table.csv

 2345 56

For getting word frequencies, see also: uniq -c


FASTQ files

# Convert FASTQ sequence-IDs into older Illumina sequence name format for paired reads (ending with "/1" or "/2")

# awk: add "/1" , "/2" every 4th line, but only if line starts with "@"

cat SAMPLE_R1.fastq | awk '{ if (NR % 4 == 1 && /^@/) {$0=$0"/1"}; {print} }'

cat SAMPLE_R2.fastq | awk '{ if (NR % 4 == 1 && /^@/) {$0=$0"/2"}; {print} }'

→ FASTQ file format


Remove lines

using sed or grep

# remove all empty lines (d - delete)

sed '/^$/d' textfile.txt > new.txt

# remove all lines containing 'NA'

grep -vw NA textfile.txt > new.txt

# remove all lines that contain multiple strings from the unwanted word list file

grep -Fwf unwanted_word_list.txt -v textfile.txt > new.txt

  -w search for full word, not substring

  -F read pattern as plain text, not interpreting special characters as regular expressions

  -f read pattern from file

  -v keep non-matching lines (remove lines that contain the searched word)