PolITiGenomics

Politics, Information Technology, and Genomics

N Genomes

AddThis Social Bookmark Button

May 9th, 2008

Earlier this week there were several meetings about the 1000 Genomes Project at Cold Spring Harbor Labs. The first meeting Monday morning was about data flow and data repositories. NCBI’s Short Read Archive (SRA) and the equivalent at EBI (which should be ready in a month or two) will house all the data. The pilot projects for the 1000 Genomes Project just started less than two months ago and have already generated as much sequence data as half of the entire trace archive (which contains the sequence data for all publicly funded genome projects over the last 10 years). In other words, this project is going to generate a lot of sequence data (not to mention all the data generated by analysis of the sequence). Paul Flicek from EBI estimates the pilot projects alone will generate about 1 PT (1,000,000 GB) of sequence data. Moving that much data from site to site will be a challenge. Normal solutions, e.g., FTP, rsync, and shipping hard drives, can’t seem to keep up with the data generation rates. NCBI, EBI, and the sequencing centers are testing a high-speed data transfer solution called Aspera scp. It has impressive transfer rates, but seems to stall after a while for no discernible reason. We’ll see if we can get it to work reliably over the coming weeks.

After the data flow meeting was a meeting of the 1000 Genomes Steering Committee. The day and a half that ensued was filled with a lot of lively discussion. When all was said and done, one thing was clear: there are a lot of questions that need to be answered. The analysis group presented convincing results from simulations that indicated 2× coverage in a large number of individual genomes (Pilot 1) is probably not sufficient to detect the rare variants the project is going after (present in 1-2% of the population). The simulations indicated that the power of the study to detect such variants (at a constant cost, i.e., constant total amount of sequence generated) would be greatly enhanced by sequencing half as many people at 4× coverage. There was no firm decision on how to change the pilot (if at all), but going forward it is likely that some of the individuals in Pilot 1 will be sequenced up to 4× or even 8×. Thus, while the project may be named 1000 Genomes, exactly how many genomes we are going to sequence is yet to be determined.

Another issue that arose was the rapid development of the massively parallel sequencing technologies. These platforms (454 FLX, Illumina Genome Analyzer, and AB SOLiD) increase their throughput, improve their data quality, improve analysis software, etc. several times each year. Such dynamic platforms make the development of tools to analyze their data, e.g., align the data to a reference genome and detect variants, very difficult. The right platforms and tools today may not be the best next month or next year when the main project gets underway. This causes two major needs to come to the fore. First, experimental design will not end when the project starts. The experiment will need to be adjusted as capabilities and capacities change. Second, we will not only have to continually develop and refine tools throughout the project, we will need to develop frameworks to continually test and compare the tools that are available. It’s always fun to hit a moving target.

The meeting also discussed the ethical, legal, and social implications (ELSI) of the project. This discussion largely focused on which populations to sample for the project. Should we deepen our knowledge of individuals of Central European, African, and East Asian ancestry to aid in methodology development? Or should we broaden our knowledge of overall human variation by including fewer individuals from a larger number of populations? To be determined…

Posted in genomics | 6 Comments »

Tagged with: , , , , , , , , , ,


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

6 Responses to “N Genomes”

  1. Hi David,

    Nice to read this blog discussing 1000 genome project. It seems that NCBI SRA does not house the corresponding 2X sequences as of now. I am also just wondering when these data will get publicly available. Also, how about the sequencing quality? Have these sequence been mapped to the reference genome now?

    Many thanks ahead.

    Yong Zhang
    University of Chicago

  2. The first data from both Pilot 1 (2×) and 2 (trios) will be available from both NCBI and EBI in FASTQ format. You can get some of the raw data from the NCBI SRA tracking page. Quality can mean a lot of things. Thus far, the data quality seems typical of the platforms they are being generated on. Each data generating center is undoubtedly mapping reads back to the human reference, but these alignments are not available currently. I believe the plan is for the 1000 Genomes DCC to provide mapping files once a standard alignment file format for massively parallel sequencing data is settled upon.

  3. You can get the FASTQ sequence from the first two data freezes here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/

  4. David,

    I went to a talk at ASHG by David Altschuler stating that the project was developing formats to transfer intermediary data types between the sites (i.e formats for analyzed data, raw data, and other steps in the process). Have any of those formats been released or published? When will they be published and will you be including the community in this process?

    Thanks,

    David Sexton
    Center for Human Genetics Research
    Vanderbilt University

  5. To the best of my knowledge, none of the formats have been published. In truth, I am trying to track down those formats as well. There are currently two: Sequence Assembly and Mapping (SAM) format and Genotype Likelihood Format (GLF). I’ll post more information as I find it.

  6. Michelle Munson Says:

    Hello all,

    On the use of Aspera Scp, the stalling behavior described is a result of artificially induced heavy packet loss for the FASP protocol, usually due to setting a target transfer rate that significantly exceeds the throughput to the storage system on the receiver side. The other cause is bandwidth shaping/artificial dropping of UDP traffic along the transmission path.

    The Aspera transfer logs (routed to syslog on Unix systems) have detailed statistics that we can interpret for you which will indicate the root cause.

    Assuming that the receiver side I/O throughput is overdriven, you can verify this for yourselves by running a 3rd party disk benchmarking utility such as bonnie++. Use bonnie to measure the write throughput for blocks of 64K and 1 MB (Aspera software uses a configurable block size, 64K by default).

    Once you know the disk throughput bottleneck, you can either set a target rate that does not exceed, or better yet, as of our 2.2 release (available as of April 2009) you can configure on the storage rate control option, which will automatically adapt the transmission rate to the storage throughput. This is much like network congestion control extended to the storage systems (a patent-pending innovation by our company).

    If you have any questions or problems on the above, be glad to help over here at Aspera. You can reach us at support@asperasoft.com or email me directly, michelle@asperasoft.com.

    Thank you,
    Michelle Munson
    President, Aspera, Inc.

Leave a Reply