Puff piece
February 16th, 2010
Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, Hype Cycle for Cloud Computing, 2009, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: Is the Future Of High- Performance Computing For Life Sciences Cloudy? Reading that story, one can only get the impression that the cloud is some panacea where all computational problems are solved. In fact, the picture is so rosy that one may become suspicious. So suspicious that one may read the About the Author section at the bottom of the piece an see that Mr. Stowe happens to be CEO of a company selling cloud computing services.
Jason Stowe is the founder and CEO of Cycle Computing, a provider of high-performance computing (HPC) and open source technology in the cloud. A seasoned entrepreneur and experienced technologist, Jason attended Carnegie Mellon and Cornell Universities.
No wonder he makes cloud computing sound so attractive. No mention of the IT expertise needed to get up and running on the cloud. No mention of the software engineering needed to ensure your programs run efficiently on the cloud. It may not be apparent from his article, but a program that runs well on one or ten computers does not necessarily run well on hundreds of computers. In fact, he implies the exact opposite.
For compute clusters as a service, the math is different: Having 40 processors work for 100 hours costs the same as having 1,000 processors run for 4 hours.
It may cost the same under that scenario, but not everything scales linearly. In fact, most things don’t and that less-than-linear scaling actually ends up making it cost more to get a shorter turnaround. This fact was clearly evident in the Crossbow paper where it cost $52 to complete the analysis in 6.5 hours but $84 to finish it under 3 hours (Table 4). The article fails to mention this; a marvel given the fact that the lack of good, scalable bioinformatics tools that can run well in highly parallel environments is perhaps the largest impediment to the adoption cloud computing in bioinformatics. Of course, I am sure he will gladly sell you consulting services that will get you up and running on the cloud. In short, this looks like a shill.
Unfortunately, omitting information is not the only problem with many of the stories about cloud computing; many also contain misinformation. For example, the story Gathering clouds and a sequencing storm in Nature Biotechnology mentions the software engineering challenges but erroneously states
…bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud
What?!? You do not have to develop tools using Hadoop. Sure it is a nice platform that provides fault-tolerant parallelism, but it is by no means required by any cloud provider that I know of (not even Google, whose MapReduce framework provided the model for Hadoop!) nor is it the only way to achieve parallel processing (far from it). Amazon EC2 just provides you with a virtual machine with a basic operating system installed on it and remote access. You can do whatever you want with it after that. Google and Microsoft do require that you develop your code in their cloud framework, but you do not have to use Hadoop. For information on what you do have to do to run jobs on the major cloud providers, check out this article by Udayan Banerjee, Cloud Economics — Amazon, Microsoft, Google Compared, and each providers web site: Amazon AWS, Google App Engine, and Microsoft Windows Azure.
(How many bad cloud puns can I work into post titles? Stay tuned.)
Posted in genomics, IT | 1 Comment »
Tagged with: cloud, genomics, informatics, IT, software
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
February 17th, 2010 at 3:47 pm
David,
Good to meet you. I’ve read your blog before, and take issue with your arguments regarding the cloud. I’ve posted at my blog http://blog.cyclecomputing.com which has more information, as well as the post below. This is an interesting area, and I’d love to correspond with you more at the e-mail at the bottom:
> No wonder he makes cloud computing sound so attractive. No mention of the
> IT expertise needed to get up and running on the cloud. No mention of the
> software engineering needed to ensure your programs run efficiently on
> the cloud.
You are implying that to get running in the cloud, an end user must worry about the “IT expertise” and “software engineering” needed to get applications up and running. I believe this is a straw-man, an incorrect assertion to begin with.
One of the major benefits of virtualized infrastructure and service oriented architectures is that they are repeatable and decouple the knowledge of building the service from the users consuming it. This means that one person, who creates the virtual machine images or the server code running the service, does need the expertise to get an application running properly in the cloud. But after that engineering is done once, a whole community of end-users of that service can benefit without knowledge of the specifics of getting the application to scale.
For example, does everyone that uses GMail/Yahoo/Hotmail know every line of software code to make it run? Do they know every operational aspect of how to make mail scale to tens of thousands of processors across many data centers?
Definitely not, and the point is they don’t have to. The same is true for high performance and high throughput computing. To give examples of free services that don’t require end user software engineering or IT expertise to do bioinformatics/proteomics/etc.:
- The NIH Website for BLAST has, for years, been running BLAST as a service so that researchers can use GUIs to run queries on parallel back-end infrastructure (see http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) This requires no complicated knowledge or software engineering for scientists to run BLAST as a Service.
- Tools like ViPDAC have 2-minute tutorial videos to run proteomics on Amazon Web Service.
Lastly, because I recognize these benefits, and think they are tremendously valuable as a way to enable organizations to focus on what they are good at, and outsource others like IT and Software engineering, I translate my passion for this area into a business that I’ve built over the past 5 years. But that doesn’t make the point any less valid.
> It may not be apparent from his article, but a program that
> runs well on one or ten computers does not necessarily run well on
> hundreds of computers. In fact, he implies the exact opposite.
> For compute clusters as a service, the math is different: Having 40
> processors work for 100 hours costs the same as having 1,000
> processors run for 4 hours.
>
> It may cost the same under that scenario, but not everything scales
> linearly. In fact, most things don’t and that less-than-linear scaling
> actually ends up making it cost more to get a shorter turnaround.
This is also a straw-man, and a deceptive one because it does contain a kernel of truth. It is true that all applications don’t perform perfectly linearly at infinite scales. So for applications like Genomic Sequencing (CrossBow), or MPI apps like computational fluid dynamics, where there are various serial pieces to the computation including overhead if nothing else, you don’t get 1000x the performance for 1000x the processors. But there are many other applications where it is possible to achieve near-linear scaling, what Condor’s Miron Livny calls “pleasantly parallel” problems: monte carlo molecular dynamics, BLAST searching with thousands of queries, risk analysis, proteomics runs with different analysis settings, etc.
In fact, regardless of whether the job is linearly scalable, most companies and research institutions don’t have 1 cluster to 1 user scenarios. There are multiple users with multiple jobs each. What if you have 10 crossbow users with 10 runs to do on various genomes? Then you can get 100x performance on the *workflow as a whole*.
And this is the problem most life science companies/research organizations face when they have multiple users. If 10 people are all submitting 10 jobs to an internal cluster, generally speaking, work can get done far faster with more resources, because the separate jobs from separate users can run along side each other. Hence, if you’re in the common case of having many users on your cluster/grid, then on the cloud the math is different, odds are 25x the processors will run near 25x faster.
This is an interesting area, and I’d love to correspond with you more off-line, my e-mail is js at Cycle Computing.