Biology of Genomes Poster
May 13th, 2009
Several people have asked for an electronic copy of my poster, Maximizing Utility of Genome Sequence Data (pdf) (posted on the Internet Archive). As is hopefully clear from the poster, in addition to high-throughput sequencing, we now have high-throughput sequence analysis. After listening to Lynda Chin‘s talk on the first evening of the conference, which described the arduous process of translating a single putative cancer driver mutation to its function in the cell, one can’t help but feel we are just kicking the can down the road here. The alleviation of one bottleneck just creates another. This was the case with the PC, where after CPUs became faster and faster, other components, e.g., memory, network, and disk I/O, became bottlenecks. This has also been the case with high-throughput production sequencing. You buy more sequencers, you need more disk, then need more CPUs to analyze all the data, and then you need to upgrade your network to move all the data around. Now in genomics, we have a situation where we are able to generate lots of data and lots of variants which may play a role in cancer. How will we be able to determine the function of all these variants? What technologies are on the horizon that will enable high-throughput functional genomics?
Posted in genomics, IT | 2 Comments »
Tagged with: CSHL, genomics, informatics, IT, science, software, wustl
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

May 16th, 2009 at 3:38 pm
This is massively over complicating the
problem. Don’t think of it as flowchart.
Think of it as pseudocode:
for each read align (result is alignment) for each alignment if anomaly check overlap alignments for same anomaly Evaluate anomalies for each anomaly if in somatic and not wild type anomaly is somaticMuch easier to understand.
May 18th, 2009 at 9:58 am
As they say, “the devil is in the details”. While your pseudo-code may be easier to understand, the flow chart on the poster is a true representation of our pipeline. You must realize that the above Illumina sequencing and variant detection pipelines are but two of many automated pipelines we have at The Genome Center. We need to make them modular so that components can be reused and highly fault tolerant such that they can be restarted at any point along the pipelines (like check-pointing). Remember, if you are a single user running data through a processing profile, if it dies at any point for any reason, you can stop, troubleshoot, and get it going again. In a high-throughput system where 300 such profiles are running in parallel, you don’t have time to stop, troubleshoot, and restart 40 of them when they fail. It’s a totally different problem.
One other thing, your pseudo-code carries a lot of data around rather than trying to reduce the amount of data at each step. You need to keep track of all your reads, all your alignments, all your anomalies, and process through them in a combinatorial manner. That is not going to scale very well. You need to be aggressive about data reduction, keeping ambiguity where necessary, but eliminate redundancy when possible. If you don’t, you will be overwhelmed by data storage and computational costs.