It seems that due to my recent post, Bioinformatics and cloud computing, I have been labeled a cloud skeptic. While I don't reject that label outright, I won't accept it either. If I may label myself, I would call myself a cloud realist. My first piece of evidence is that at the end of my previous post I specifically state, "This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post." Unfortunately, this is not the future post to which that statement refers. The purpose of this post is to respond to some of the comments made on that post and around the web.

First, Ben Langmead said,

My main comment is that you’re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers. That isn’t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn’t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.

Bob Carpenter then adds similar comments,

To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost. For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)? How much space do they take up? The power for these beasts is not inconsiderable… My wife’s having trouble with her cluster at NYU because the building’s heating and cooling are both tied to the same faulty plumbing system; so even though it’s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two. Just like when the AC went out in the summer.

Finally, Shiran Pasternak over at Plant Tech Tonics says

What his numbers don't take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are high and expected to increase in the near future.

Yes, it is true that the cost for the Dell server I quoted was just the purchase price. But the price I quoted for a computing core in our cluster, $500, was a fully loaded cost. As indicated in the post, that fully loaded cost includes server, rack, networking, electrical hookup, installation, 3-year warranty, etc. In other words, that is the cost to add a core to an existing cluster and was provided for those researchers that do have clusters (as opposed to the cost of the Dell which was provided for those who do not). It does not include system administration, electrical power, or cooling. In other words, it does not include ongoing costs, only capital costs. Why did I not include those ongoing costs? Because I did not need to. To maintain pace with the sequence data generated by an Illumina GA IIx or two, you don't need any of that stuff! For electrical power and cooling, the addition of a few cores to an existing computing infrastructure is not going to make a substantive difference in power or cooling. For a lab without an existing computing cluster, all you need is the desk where you sit your bioinformatician. If you are at a normally operating university, the electrical power and cooling to office space is provided from the overhead your university takes out of your grants. If you operate a core facility at a university, then you simply work these costs into the fees you charge (their contributions are several orders of magnitude less than the sequencing reagents). What about labs who have lots of sequencers but not a lot of computing power? Well, that's bad planning and allocation of assets; no one can help you.

Systems administration costs are a similar story. For researchers with existing clusters, the addition of a few cores to keep pace with a few Illumina instruments will not require them to hire additional IT staff. For researchers without a cluster, I posit that it does not take more system administration costs to manage a single desktop workstation than it would to manage a cluster of Amazon EC2 nodes. Amazon EC2 provides virtual hardware and a stock installation of an operating system. Aside from the fact that you can purchase computers from Dell with Red Hat Enterprise [GNU/]Linux, any bioinformatician worth her salt (or any 12-year-old for that matter) can install Ubuntu on a computer. Just as the Dell customer will have to install their bioinformatics tools on the systems, so too will the Amazon EC2 customer; except they will need to install them on all the nodes they have rented. Regarding maintaining security patches and other updates, that is also dead simple in Ubuntu (although I will readily admit that just because something is easy, it does not necessarily follow that people will do it). The bottom line is that maintaining a workstation used for day-to-day activities and analyzing data from one or two Illumina instruments is more likely to be within the capabilities of a bioinformatician than setting up and maintaining an Amazon EC2 cluster.

Another point brought up in the above comments was reliability of the systems. One of the arguments in this area is that with your own hardware, you are responsible for maintaining the equipment while with Amazon EC2, they manage all the hardware. This is not really the case, though. All of the costs I have quoted included a 3-year warranty with on-site service. The reliability argument also involves downtime. If your local systems go down, whether for hardware failures, network outages, power outages, or Armageddon, it is true that you will not be able to do any computations on them, but you're also not going to be able to access your EC2 systems and those EC2 systems will not be able to pull data from your systems (and in the case of Armageddon, Amazon EC2 will probably also be down).

So, that leaves us with the question, what would the fully loaded cost of the Dell workstation be, and what is the break even point with Amazon EC2? The cost of the quad-core system was roughly $1700. You only need one core for data analysis. Since you need to buy your bioinformatician a workstation anyway and it needs an operating system, bioinformatics software, power, and cooling, we'll ignore those costs. So the purchase price becomes the fully loaded costs for comparison purposes. Assuming you would buy your bioinformatician a dual-core systems with 1 GiB of RAM (Firefox uses a lot of memory) which costs about $1000, the incremental cost of getting a machine capable of analyzing data is $700; the incremental cost per computing core is only $350. That dollar amount will buy you less than three genomes worth of analysis on Amazon EC2.

Bob Carpenter had a few other points worth addressing: viruses and running analysis multiple times. I would argue that the former is an issue regardless of where you run your analysis. Plus, for the GNU/Linux systems we are talking about in these scenarios, viruses are much less of an issue than they are for Microsoft Windows. Regarding running analysis multiple times, sure it would mean you may need more than one core to keep up, but it also means you are going to pay Amazon a lot more too. With the quad-core system quoted above, you have a whole extra core (two for the desktop, one for the single pass analysis, and one extra) to spill over into at no cost.

Before I close, I would like to thank all the commenters for raising the above points. All of the issues they raised are very important to consider when jumping into the next-generation informatics space. They also made it clear that my previous post was not as thorough as I thought it was when I hit the publish button. In addition to the excellent comments I quoted above, there were also several other good points regarding software in the comments of the previous post that I hope to incorporate in future posts (and hopefully this post will generate a few comments as well).