A previous cloud post, Puff piece, has gotten a bit of attention from Jason Stowe and Informatics Iron. While the Informatics Iron piece was positive, Mr. Stowe took issue with some of the points I made. First, he says that my claim that IT and software engineering is needed to get things running on the cloud is inaccurate.

You are implying that to get running in the cloud, an end user must worry about the “IT expertise” and “software engineering” needed to get applications up and running. I believe this is a straw-man, an incorrect assertion to begin with.

One of the major benefits of virtualized infrastructure and service oriented architectures is that they are repeatable and decouple the knowledge of building the service from the users consuming it. This means that one person, who creates the virtual machine images or the server code running the service, does need the expertise to get an application running properly in the cloud. But after that engineering is done once, a whole community of end-users of that service can benefit without knowledge of the specifics of getting the application to scale.

For example, does everyone that uses GMail/Yahoo/Hotmail know every line of software code to make it run? Do they know every operational aspect of how to make mail scale to tens of thousands of processors across many data centers?

Definitely not, and the point is they don’t have to. The same is true for high performance and high throughput computing. To give examples of free services that don’t require end user software engineering or IT expertise to do bioinformatics/proteomics/etc.:

  • The NIH Website for BLAST has, for years, been running BLAST as a service so that researchers can use GUIs to run queries on parallel back-end infrastructure (see http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) This requires no complicated knowledge or software engineering for scientists to run BLAST as a Service.
  • Tools like ViPDAC have 2-minute tutorial videos to run proteomics on Amazon Web Service.

His argument is absolutely correct when dealing with established systems, applications, and work flows. For use cases like email and running BLAST, there is no need for additional software engineering or IT expertise (other than getting on the internet). In fact, The Genome Center has long offered a BLAST service for anyone to use. Further, over the past few weeks, several prepackaged bioinformatics work flows that run on the cloud (or some approximation thereof) have been announced: Mr. Stowe's company Cycle Computing announced CycleCloud for Life Sciences, GenomeQuest SDM, Cloud Bio-Linux from Bio-Team, ChIP-seq and RNA-seq analysis pipelines from DNAnexus, the work flows available in Galaxy, and of course the previously published Crossbow. Unfortunately, canned analyses are not the norm in bioinformatics. Bioinformaticians love to tinker, trying to get just a little more biological information out of their data sets. The result is that bioinformatics applications and work flows are constantly being tweaked, updated, and improved. Because of this, maintenance of these pipelines is a huge burden. The supporters of these generic pipelines must work constantly to update and verify software or the users will constantly be waiting for the latest fix to be applied or latest feature to be available (anyone who installs each new version of velvet can attest to this). The saving grace in all of this is that as the use of sequencing becomes more widespread, the percentage of the people doing the analysis that classify as bioinformaticians will decrease (greatly). This means that a larger and larger percentage of people with sequence data to analyze will likely not be interested in tweaking analysis pipelines but will just want to run something and get an answer. It is this ever growing group of people that will greatly benefit from easy to use analysis tools, whether they be deployed on the cloud or not. Both Mr. Stowe and I agree that creating easy to use tools for non-bioinformaticians to use is a very worthwhile goal. Unfortunately the proliferation of existing tool options (e.g., maq, bwa, bowtie, bfast, soap, novoalign, etc.) now layered with a proliferation of cloud offerings will make it even more difficult for non-experts to chose which pipeline is the best to use. Therefore approaches like those taken by Cycle Computing and GenomeQuest that provide default analysis pipelines and the ability for bioinformaticians to create and share their own work flows are the most likely to be successful. The development of these generic, distributed analysis frameworks that also provide useful defaults is an even more worthwhile goal because it achieves two important ends: ease of use for non-experts and the ability for bioinformaticians to tinker. Bioinformaticians are more likely to find tools like these useful and therefore will be early adopters, choose the best platforms, establish best-practices on these platforms, publish results using these platforms, and then the non-experts will follow.

Mr. Stowe's other objection related to my point that no process scales linearly with the number of cores. He concedes that point but points out

In fact, regardless of whether the job is linearly scalable, most companies and research institutions don’t have 1 cluster to 1 user scenarios. There are multiple users with multiple jobs each. What if you have 10 crossbow users with 10 runs to do on various genomes? Then you can get 100x performance on the *workflow as a whole*.

Again, this is true, but, to be fair, that is not the same point he made in his original article. His original point was that if you needed your analysis to run faster you could just provision more nodes. I just pointed out that this is true, but you would likely pay a premium for that because nothing scales linearly. It may seem like a fine distinction, but with all the misinformation around clouds nowadays, it's an important one to make. It should also be noted that without good software engineering and system administration, even algorithms that should scale nearly linearly might not. The take-home message is that if someone has done that software engineering and systems administration work to make a program scale well and run well in a cloud envrionment and made it available to you, great. If not, someone is going to have to do it.

I had the opportunity to meet Mr. Stowe at the XGen Congress and have talked more with him this week at Bio-IT World Conference and Expo (my talk is tomorrow at 11 a.m. EDT in Track 3: Bioinformatics and Next-Gen Data). We had a good discussion about cloud computing and its role in bioinformatics (they've got a cool solution to the Amazon storage problem). As you can hopefully tell from this post, we are largely in agreement: engineering is needed, but once it is done, everyone benefits. Cycle Computing certainly has a lot of good expertise in the cloud, so if you need some engineering done, shoot him an email. Unfortunately, they probably will not be able to help you access the largest cloud computing service.