N50 statistics

Genome assembly

N50 is a measure to describe the quality of assembled genomes that are fragmented in contigs of different length.

N50 is the shortest contig length that needs to be included for covering 50% of the genome.

Meaning

-> Half of the genome sequence is covered by contigs larger than or equal the N50 contig size.

-> The sum of the lengths of all contigs of size N50 or longer contain at least 50 percent of the total genome sequence.

N50 is not simply the median over all contigs lengths, it is a length weighted median that gives a more robust quality value than a simple median, see explanation by Keith Bradnam:

http://www.acgt.me/blog/2013/7/8/why-is-n50-used-as-an-assembly-metric.html

How to calculate the N50

Intuitively, to get the N50 contig length, simply sort all contigs of a genome by their length, go to the base in the center at 50% of the total genome length, get the contig size to which this base belongs to and you have the N50 contig length.

Example

For an assembly fragmented into contigs with lengths: 5, 4, 2, and 1 kb (total length = 12 kb), half of the genome length is covered by the two largest contigs, including the 4kb contig. N50=4kb is the minimum contig length required to cover 50 percent of the assembled genome sequence.

N10 is the minimum contig length to cover 10 percent of the genome.

N90 is the minimum contig length to cover 90 percent of the genome.


read more

http://seqanswers.com/forums/showthread.php?t=2332

https://www.broad.harvard.edu/crd/wiki/index.php/N50

http://www.acgt.me/blog/2013/7/8/why-is-n50-used-as-an-assembly-metric.html

https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics