Earlier this week there were several meetings about the 1000 Genomes Project at Cold Spring Harbor Labs. The first meeting Monday morning was about data flow and data repositories. NCBI's Short Read Archive (SRA) and the equivalent at EBI (which should be ready in a month or two) will house all the data. The pilot projects for the 1000 Genomes Project just started less than two months ago and have already generated as much sequence data as half of the entire trace archive (which contains the sequence data for all publicly funded genome projects over the last 10 years). In other words, this project is going to generate a lot of sequence data (not to mention all the data generated by analysis of the sequence). Paul Flicek from EBI estimates the pilot projects alone will generate about 1 PT (1,000,000 GB) of sequence data. Moving that much data from site to site will be a challenge. Normal solutions, e.g., FTP, rsync, and shipping hard drives, can't seem to keep up with the data generation rates. NCBI, EBI, and the sequencing centers are testing a high-speed data transfer solution called Aspera scp. It has impressive transfer rates, but seems to stall after a while for no discernible reason. We'll see if we can get it to work reliably over the coming weeks.

After the data flow meeting was a meeting of the 1000 Genomes Steering Committee. The day and a half that ensued was filled with a lot of lively discussion. When all was said and done, one thing was clear: there are a lot of questions that need to be answered. The analysis group presented convincing results from simulations that indicated 2× coverage in a large number of individual genomes (Pilot 1) is probably not sufficient to detect the rare variants the project is going after (present in 1-2% of the population). The simulations indicated that the power of the study to detect such variants (at a constant cost, i.e., constant total amount of sequence generated) would be greatly enhanced by sequencing half as many people at 4× coverage. There was no firm decision on how to change the pilot (if at all), but going forward it is likely that some of the individuals in Pilot 1 will be sequenced up to 4× or even 8×. Thus, while the project may be named 1000 Genomes, exactly how many genomes we are going to sequence is yet to be determined.

Another issue that arose was the rapid development of the massively parallel sequencing technologies. These platforms (454 FLX, Illumina Genome Analyzer, and AB SOLiD) increase their throughput, improve their data quality, improve analysis software, etc. several times each year. Such dynamic platforms make the development of tools to analyze their data, e.g., align the data to a reference genome and detect variants, very difficult. The right platforms and tools today may not be the best next month or next year when the main project gets underway. This causes two major needs to come to the fore. First, experimental design will not end when the project starts. The experiment will need to be adjusted as capabilities and capacities change. Second, we will not only have to continually develop and refine tools throughout the project, we will need to develop frameworks to continually test and compare the tools that are available. It's always fun to hit a moving target.

The meeting also discussed the ethical, legal, and social implications (ELSI) of the project. This discussion largely focused on which populations to sample for the project. Should we deepen our knowledge of individuals of Central European, African, and East Asian ancestry to aid in methodology development? Or should we broaden our knowledge of overall human variation by including fewer individuals from a larger number of populations? To be determined…