PolITiGenomics

Politics, Information Technology, and Genomics

Gathering cloud at XGen

March 10th, 2010 dd Posted in IT, genomics No Comments »

If you are going to be at XGen next week and you are interested in cloud computing and its application to bioinformatics, be sure to stop and participate in the Cloud Computing in Bioinformatics discussion I will be “facilitating” on Wednesday morning (March 17). My talk is at 3:05 p.m. PT on Tuesday and I will be chairing the first session on Monday (if my plane is on time and the taxi is fast enough).

AddThis Social Bookmark Button

New data center approved

March 10th, 2010 dd Posted in IT, genomics 1 Comment »

The Genome Center recently received word that its grant proposal for a data center was approved (St. Louis Business Journal). The $14.3 million grant is funded by National Center for Research Resources and the money comes from ARRA. The grant, along with about $8 million dollars from Washington University, will allow us to essentially duplicate our current data center capacity. We took possession of our current data center in May 2008 and it is already 80-90% full, so this new data center will greatly help us to keep pace with all of the exciting, new projects we are undertaking.

AddThis Social Bookmark Button

Me, in podcast form

February 24th, 2010 dd Posted in IT, genomics No Comments »

I recently did an interview in advance of my talk at the XGen Congress next month in San Diego. The interview is about 14 minutes and discusses our work at The Genome Center in general and more specifically the software and IT infrastructure we have created to enable the analysis of the massive amounts of sequence data we generate. The interview is available to download as part of the XGen Congress podcast series.

AddThis Social Bookmark Button

The Pac’s out of the bag

February 23rd, 2010 dd Posted in genomics No Comments »

Most of you have probably already seen this, but Pacific Biosciences announced the institutions that will be getting their first ten prototype instruments (Bio-IT World, GenomeWeb, MarketWatch). The Genome Center is among the institutions that will be getting one. It looks like PacBio will indeed be the first third generation sequencing company with instruments out in the wild. Don’t get too excited though, it’s probable that these third generation instruments will be a lot like the first batch of second generation instruments: it will take a while before they are ready for production sequencing, reliably producing good quality data. We’ll find out more from all the sequencing instrument companies in the coming days at AGBT.

AddThis Social Bookmark Button

Next-Generation Sequencing Informatics Update

February 19th, 2010 dd Posted in IT, genomics 6 Comments »

I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides & Associates blog linked to my table and referred to it as a “somewhat dated comparison of next-generation sequencing platforms.” Just to clarify, this table represents average throughput for production systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.

AddThis Social Bookmark Button

Puff piece

February 16th, 2010 dd Posted in IT, genomics 1 Comment »

Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, Hype Cycle for Cloud Computing, 2009, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: Is the Future Of High- Performance Computing For Life Sciences Cloudy? Reading that story, one can only get the impression that the cloud is some panacea where all computational problems are solved. In fact, the picture is so rosy that one may become suspicious. So suspicious that one may read the About the Author section at the bottom of the piece an see that Mr. Stowe happens to be CEO of a company selling cloud computing services.

Jason Stowe is the founder and CEO of Cycle Computing, a provider of high-performance computing (HPC) and open source technology in the cloud. A seasoned entrepreneur and experienced technologist, Jason attended Carnegie Mellon and Cornell Universities.

No wonder he makes cloud computing sound so attractive. No mention of the IT expertise needed to get up and running on the cloud. No mention of the software engineering needed to ensure your programs run efficiently on the cloud. It may not be apparent from his article, but a program that runs well on one or ten computers does not necessarily run well on hundreds of computers. In fact, he implies the exact opposite.

For compute clusters as a service, the math is different: Having 40 processors work for 100 hours costs the same as having 1,000 processors run for 4 hours.

It may cost the same under that scenario, but not everything scales linearly. In fact, most things don’t and that less-than-linear scaling actually ends up making it cost more to get a shorter turnaround. This fact was clearly evident in the Crossbow paper where it cost $52 to complete the analysis in 6.5 hours but $84 to finish it under 3 hours (Table 4). The article fails to mention this; a marvel given the fact that the lack of good, scalable bioinformatics tools that can run well in highly parallel environments is perhaps the largest impediment to the adoption cloud computing in bioinformatics. Of course, I am sure he will gladly sell you consulting services that will get you up and running on the cloud. In short, this looks like a shill.

Unfortunately, omitting information is not the only problem with many of the stories about cloud computing; many also contain misinformation. For example, the story Gathering clouds and a sequencing storm in Nature Biotechnology mentions the software engineering challenges but erroneously states

…bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud

What?!? You do not have to develop tools using Hadoop. Sure it is a nice platform that provides fault-tolerant parallelism, but it is by no means required by any cloud provider that I know of (not even Google, whose MapReduce framework provided the model for Hadoop!) nor is it the only way to achieve parallel processing (far from it). Amazon EC2 just provides you with a virtual machine with a basic operating system installed on it and remote access. You can do whatever you want with it after that. Google and Microsoft do require that you develop your code in their cloud framework, but you do not have to use Hadoop. For information on what you do have to do to run jobs on the major cloud providers, check out this article by Udayan Banerjee, Cloud Economics — Amazon, Microsoft, Google Compared, and each providers web site: Amazon AWS, Google App Engine, and Microsoft Windows Azure.

(How many bad cloud puns can I work into post titles? Stay tuned.)

AddThis Social Bookmark Button

Seq-o-matic ‘76

February 3rd, 2010 dd Posted in genomics No Comments »

Bass-o-matic

Soon after Illumina announced its HiSeq 2000, it also announced the GA IIx’s little brother, the GA IIe. The IIe will produce about half as much data as the IIx, but no one seems to know exactly how this is done. The unit is cheaper than the IIx, $250,000 for the IIe compared to $400,000 (I think) for the IIx, but is upgradeable to the IIx. So perhaps the optics system is cheaper. But the run time is the same, so it seems like the optics would need to be about the same (the older optics system was slower). The IIe seems to use the same kits as the GA IIx. That seems odd to me because the consumables cost is typically the largest part of the per run cost. So while you will save on instrument depreciation costs per run, those savings disappear when considering cost per Gb. Another way to look at it is that if reagent costs are indeed the same, it makes no sense to buy two GA IIe instruments. You would be much better off buying one GA IIx. It is only if your lab has a sequencing workload that cannot utilize a GA IIx full time that a GA IIe makes economic sense.

AddThis Social Bookmark Button

Life finds a way

January 29th, 2010 dd Posted in genomics 7 Comments »

SOLiD 4

Earlier this week Life Technologies announced the next revision of their SOLiD platform, SOLiD 4. I don’t have all the details that I had for the Illumina HiSeq 2000, but here is what I do know: the system will produced 100 Gb of alignable sequence data on two slides per 14 day run. The sequence data will be paired-end, 50×35 base reads. Reagent costs for each run will be about $6,000. Since you need about 100 Gb of sequence to sequence a human genome, you’re looking at about $6000 in reagent costs per human genome. They also indicated that capacity for the instrument will increase to 300 Gb per run and the cost for reagents per human genome will be less than $3000 by the end of 2010. In comparison, the Illumina HiSeq 2000 reagent costs will be about $10,000 per human genome at its release with, by my calculations, a path to about $4000 per human genome (I have no idea what the time frame might be to reach the end of that path, but given this announcement by Life, it will likely be aggressive). You have to love the way competition drives down costs. Similar to Illumina’s announcement of a big HiSeq 2000 purchase at its announcement, Life announced that Ignite Institute would acquire 100 SOLiD 4 instruments as part of partnership with Life. Life also announced a major bioinformatics investment program as well as a physician education program through their Foundation.

Update: According to the press release, Ignite is “acquiring”, not purchasing, the instruments in “partnership” with Life. So it appears this is not an outright purchase of a large number of instruments. I have updated the text in the post to be more accurate.

AddThis Social Bookmark Button

Data Center in St. Louis Commerce Magazine

January 28th, 2010 dd Posted in IT, genomics 1 Comment »

There is a story about regional data centers in the Jan/Feb 2010 issue of St. Louis Commerce Magazine that includes a section on our Genome Data Center; the only regional data center to achieve LEED certification (and gold at that!). Unfortunately, the issue seems to only be available as part of a Flash application, so I cannot link to the story, only to the issue and tell you that the data center story starts on page 62 and the Genome Data Center section is on page 64 (it includes pictures!). This issue of the magazine also includes stories on cloud computing and Washington University in St. Louis Chancellor Mark Wrighton (and high-speed rail of course).

AddThis Social Bookmark Button

NCI on PCGP

January 28th, 2010 dd Posted in genomics No Comments »

The National Cancer Institute (NCI) posted a couple stories that discuss, directly and indirectly, the Pediatric Cancer Genome Project. The first, St. Jude, Washington University Launch Genome Project for Childhood Cancers, is, obviously, about the project. The second is A Conversation about Sequencing Cancer Genomes with Dr. Elaine Mardis.

AddThis Social Bookmark Button