July 17th, 2009
At the NHGRI Sequencing Advisory Panel meeting last week, there was some confusion about how we use SNP array data and dbSNP in the DNA sequencing world. SNP arrays are, when you boil it down, a quick and cheap way to sequence, i.e., determine the nucleotide at, specific DNA positions in the genome. For each sample to be sequenced at The Genome Center, we use SNP array data from that sample to measure breadth of sequence coverage we have achieved. In short, at certain points in the sequencing of a sample, variation calling is done, generating, among other things, single nucleotide variants (SNVs, single nucleotides that differ from the standard human reference sequence). Once our variant detection pipeline is able to find 99% or more of the SNPs found on the SNP array of that sample, we are confident that we have good coverage of the genome.
dbSNP, a catalog of common SNPs, is used in a different way. Any individual is expected to have somewhere around 75-85% of his or her SNVs in common with those in dbSNP. Thus, we expect that when calling SNVs, about 80% of them should appear in dbSNP, i.e., not be private mutations. Thus, dbSNP concordance (the percentage of SNVs found in dbSNP) is used to measure an approximate false positive rate of the detection algorithm. If the rate of dbSNP concordance is much lower than 80%, your results likely have a lot of false positives.
So SNP arrays from the samples are used to measure breadth of coverage and dbSNP is used to measure the accuracy of variant detection. What is the difference between a SNP and a SNV? A SNV is a private mutation while a SNP is a mutation that is shared amongst a population.