This week The Genome Center will begin sequencing our first samples from the 1000 Genomes Project. This project, which aims to create a much deeper understanding of normal human variation, will begin by sequencing DNA from the well-characterized samples of the HapMap Project. The initial sequencing plan consists of three separate pilot projects, each of which will generate an unprecedented amount of genomic information.

The first pilot involves sequencing a large number of individuals, about 200, to relatively low coverage, 2× (6 Gb per sample). This study will probe the efficacy of light sequence coverage of a large number of individuals. The resulting data will help to guide the development of methods for imputation of haplotypes from incomplete sequence data. In other words, the ability to predict variation based on other detected variations in the same region of the genome. The data for this pilot project will largely be generated on the Illumina/Solexa platform.

The second pilot project involves the sequencing of two sets of trios (mother-father-child) to high, 20-fold or 20×, coverage. 20× coverage means for each individual's three gigabase (Gb) genome, we will generate about 60 Gb of data. The purpose of this pilot project is several fold. First, by sequencing parents and their offspring, we will attain a better understanding of heritability and genomic mutation rates. Second, by sequencing these subjects to high redundancy, we will be able to gauge the depth of sequence necessary in further studies of human variation. The data for this pilot project will largely be generated on the Illumina (Solexa) platform.

The third pilot project will sequence the exome of over 1000 individuals to 20× coverage. The exome is the collection of all exons in the genome. An exon is the part of the gene whose sequence is transcribed to make proteins. To selectively sequence just the exome, the genome is fragmented and various techniques are used to selectively capture the regions of the genome that are exonic. At present, these selection techniques require a significant amount of input DNA and therefore are only usable on samples that have ample DNA. This projects aims to get a detailed picture of genomic variation that directly affects proteins and, therefore, the work of life. The advantage of this approach is the focus on protein changes which can readily be interpreted in terms of how such changes affect the operation of the cell. The disadvantage of this approach is that you do not gain any information on the majority of the genome that does not encode proteins and whose role in the functioning of the cell is little understood. Data for this pilot will largely be generated on the 454 platform.

Together these three pilot projects will not only generate mountains of useful information on human variation, they will also inform the future direction of the 1000 Genomes Project.