PolITiGenomics

Politics, Information Technology, and Genomics

Biology of Genomes Poster

AddThis Social Bookmark Button

May 13th, 2009

CSHL Biology of Genomes poster

Several people have asked for an electronic copy of my poster, Maximizing Utility of Genome Sequence Data (pdf) (posted on the Internet Archive). As is hopefully clear from the poster, in addition to high-throughput sequencing, we now have high-throughput sequence analysis. After listening to Lynda Chin‘s talk on the first evening of the conference, which described the arduous process of translating a single putative cancer driver mutation to its function in the cell, one can’t help but feel we are just kicking the can down the road here. The alleviation of one bottleneck just creates another. This was the case with the PC, where after CPUs became faster and faster, other components, e.g., memory, network, and disk I/O, became bottlenecks. This has also been the case with high-throughput production sequencing. You buy more sequencers, you need more disk, then need more CPUs to analyze all the data, and then you need to upgrade your network to move all the data around. Now in genomics, we have a situation where we are able to generate lots of data and lots of variants which may play a role in cancer. How will we be able to determine the function of all these variants? What technologies are on the horizon that will enable high-throughput functional genomics?

Posted in genomics, IT | 2 Comments »

Tagged with: , , , , , ,


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

2 Responses to “Biology of Genomes Poster”

  1. hOGjOWLSmCgEE Says:

    This is massively over complicating the
    problem. Don’t think of it as flowchart.

    Think of it as pseudocode:

    for each read
       align (result is alignment)
    for each alignment
       if anomaly
           check overlap alignments for same anomaly
    Evaluate anomalies
    for each anomaly
       if in somatic and not wild type
           anomaly is somatic
    

    Much easier to understand.

  2. As they say, “the devil is in the details”. While your pseudo-code may be easier to understand, the flow chart on the poster is a true representation of our pipeline. You must realize that the above Illumina sequencing and variant detection pipelines are but two of many automated pipelines we have at The Genome Center. We need to make them modular so that components can be reused and highly fault tolerant such that they can be restarted at any point along the pipelines (like check-pointing). Remember, if you are a single user running data through a processing profile, if it dies at any point for any reason, you can stop, troubleshoot, and get it going again. In a high-throughput system where 300 such profiles are running in parallel, you don’t have time to stop, troubleshoot, and restart 40 of them when they fail. It’s a totally different problem.

    One other thing, your pseudo-code carries a lot of data around rather than trying to reduce the amount of data at each step. You need to keep track of all your reads, all your alignments, all your anomalies, and process through them in a combinatorial manner. That is not going to scale very well. You need to be aggressive about data reduction, keeping ambiguity where necessary, but eliminate redundancy when possible. If you don’t, you will be overwhelmed by data storage and computational costs.

Leave a Reply