PolITiGenomics

Politics, Information Technology, and Genomics

Gathering cloud at XGen

March 10th, 2010 dd Posted in IT, genomics No Comments »

If you are going to be at XGen next week and you are interested in cloud computing and its application to bioinformatics, be sure to stop and participate in the Cloud Computing in Bioinformatics discussion I will be “facilitating” on Wednesday morning (March 17). My talk is at 3:05 p.m. PT on Tuesday and I will be chairing the first session on Monday (if my plane is on time and the taxi is fast enough).

AddThis Social Bookmark Button

New data center approved

March 10th, 2010 dd Posted in IT, genomics No Comments »

The Genome Center recently received word that its grant proposal for a data center was approved (St. Louis Business Journal). The $14.3 million grant is funded by National Center for Research Resources and the money comes from ARRA. The grant, along with about $8 million dollars from Washington University, will allow us to essentially duplicate our current data center capacity. We took possession of our current data center in May 2008 and it is already 80-90% full, so this new data center will greatly help us to keep pace with all of the exciting, new projects we are undertaking.

AddThis Social Bookmark Button

Me, in podcast form

February 24th, 2010 dd Posted in IT, genomics No Comments »

I recently did an interview in advance of my talk at the XGen Congress next month in San Diego. The interview is about 14 minutes and discusses our work at The Genome Center in general and more specifically the software and IT infrastructure we have created to enable the analysis of the massive amounts of sequence data we generate. The interview is available to download as part of the XGen Congress podcast series.

AddThis Social Bookmark Button

Next-Generation Sequencing Informatics Update

February 19th, 2010 dd Posted in IT, genomics 6 Comments »

I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides & Associates blog linked to my table and referred to it as a “somewhat dated comparison of next-generation sequencing platforms.” Just to clarify, this table represents average throughput for production systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.

AddThis Social Bookmark Button

Puff piece

February 16th, 2010 dd Posted in IT, genomics 1 Comment »

Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, Hype Cycle for Cloud Computing, 2009, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: Is the Future Of High- Performance Computing For Life Sciences Cloudy? Reading that story, one can only get the impression that the cloud is some panacea where all computational problems are solved. In fact, the picture is so rosy that one may become suspicious. So suspicious that one may read the About the Author section at the bottom of the piece an see that Mr. Stowe happens to be CEO of a company selling cloud computing services.

Jason Stowe is the founder and CEO of Cycle Computing, a provider of high-performance computing (HPC) and open source technology in the cloud. A seasoned entrepreneur and experienced technologist, Jason attended Carnegie Mellon and Cornell Universities.

No wonder he makes cloud computing sound so attractive. No mention of the IT expertise needed to get up and running on the cloud. No mention of the software engineering needed to ensure your programs run efficiently on the cloud. It may not be apparent from his article, but a program that runs well on one or ten computers does not necessarily run well on hundreds of computers. In fact, he implies the exact opposite.

For compute clusters as a service, the math is different: Having 40 processors work for 100 hours costs the same as having 1,000 processors run for 4 hours.

It may cost the same under that scenario, but not everything scales linearly. In fact, most things don’t and that less-than-linear scaling actually ends up making it cost more to get a shorter turnaround. This fact was clearly evident in the Crossbow paper where it cost $52 to complete the analysis in 6.5 hours but $84 to finish it under 3 hours (Table 4). The article fails to mention this; a marvel given the fact that the lack of good, scalable bioinformatics tools that can run well in highly parallel environments is perhaps the largest impediment to the adoption cloud computing in bioinformatics. Of course, I am sure he will gladly sell you consulting services that will get you up and running on the cloud. In short, this looks like a shill.

Unfortunately, omitting information is not the only problem with many of the stories about cloud computing; many also contain misinformation. For example, the story Gathering clouds and a sequencing storm in Nature Biotechnology mentions the software engineering challenges but erroneously states

…bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud

What?!? You do not have to develop tools using Hadoop. Sure it is a nice platform that provides fault-tolerant parallelism, but it is by no means required by any cloud provider that I know of (not even Google, whose MapReduce framework provided the model for Hadoop!) nor is it the only way to achieve parallel processing (far from it). Amazon EC2 just provides you with a virtual machine with a basic operating system installed on it and remote access. You can do whatever you want with it after that. Google and Microsoft do require that you develop your code in their cloud framework, but you do not have to use Hadoop. For information on what you do have to do to run jobs on the major cloud providers, check out this article by Udayan Banerjee, Cloud Economics — Amazon, Microsoft, Google Compared, and each providers web site: Amazon AWS, Google App Engine, and Microsoft Windows Azure.

(How many bad cloud puns can I work into post titles? Stay tuned.)

AddThis Social Bookmark Button

Data Center in St. Louis Commerce Magazine

January 28th, 2010 dd Posted in IT, genomics 1 Comment »

There is a story about regional data centers in the Jan/Feb 2010 issue of St. Louis Commerce Magazine that includes a section on our Genome Data Center; the only regional data center to achieve LEED certification (and gold at that!). Unfortunately, the issue seems to only be available as part of a Flash application, so I cannot link to the story, only to the issue and tell you that the data center story starts on page 62 and the Genome Data Center section is on page 64 (it includes pictures!). This issue of the magazine also includes stories on cloud computing and Washington University in St. Louis Chancellor Mark Wrighton (and high-speed rail of course).

AddThis Social Bookmark Button

Cloudy with a chance of sunshine

January 25th, 2010 dd Posted in IT No Comments »

As stated in previous posts (Bioinformatics and cloud computing and Head in the clouds), I don’t think that cloud computing wins the cost competition with local resources. However, there are several reasons why an organization should consider cloud computing. Several of the reasons I present below are discussed in a great interview with Russ Daniels of HP at ars technica, Into the cloud: a conversation with Russ Daniels, Part I and Part II. If you are at all curious about cloud computing, it is well worth reading. (You may also be interested in the ScienceCloud 2010 Workshop.)

Peaks and valleys

The ability to dynamically provision computing resources is integral to the concept of clouds. Dynamic provisioning is often used by online retailers to account for variability in consumer buying. The retailer may have 20 servers that it maintains year round to service average purchasing but also dynamically add servers in the cloud to account for peaks in purchasing, e.g., around the Christmas holiday. In bioinformatics, there are often computational crunches before papers get submitted or before meetings or when a mistake in an algorithm is found and a large amount of calculations need to be redone (Miron Livny of Condor and Open Science Grid calls these “oopses”). Another type of dynamic provisioning involves varying levels of certain hardware architectures or operating systems as needed by current computational demand. For example, certain applications may require x86 and Ubuntu 8.04 LTS while another may require amd64/em64t/x86_64 and Ubuntu 9.10. If the utilization of each of these programs is cyclical, you can provision the exact system you want when it is needed. This can be done using something like Amazon EC2 or an internal cloud. Thus, dynamic provisioning allows IT departments to design their solutions for steady state operations but still meet computational needs during peaks.

Space, the final frontier

At universities all over the world there is a constant battle for space. Researchers are always seeking more and administrators are always miserly about allocating it. If your computing needs expand beyond your ability to house, power, and cool them, cloud computing offers a solution. While it may not be cheaper than if the space, power, and cooling was available and paid for out of your grant overhead, it will almost certainly be cheaper than buying your own land and building your own data center. Of course, what people traditionally think of as cloud computing, e.g., Amazon EC2, is not the only option here. There are collocation facilities and scientific computing resources, e.g., NCSA and Open Science Grid. The latter are normally acquired through a granting process.

Persistence pays off

Cloud computing is also very attractive because of its persistence. If I have my computing and storage in the cloud, I can access it from anywhere. When the power goes out at my office, I can use my phone to access the data. When my computer crashes, the computation is still running on the cloud. When my disk fails, my data is still in the cloud. Of course, the cloud does fail at times too. Amazon promises 99.9% uptime, or nearly 9 hours of downtime per year. Of course, if the cloud resources are pulling data from your site (something that may take more time than the computation with current solutions), when your systems go down, you’re still out of luck.

AddThis Social Bookmark Button

Airline security

January 13th, 2010 dd Posted in IT No Comments »

Despite the fact that I was traveling when I wrote this, this post is not about air travel, but it is about security. One topic that continually comes up when the subject of cloud computing is discussed is security. A recent article in MIT Technology Review, Security in the Ether, discusses the issues. CNN tries to scare you with a title like A trip into the secret, online ‘cloud’. Spooky stuff. It’s not a cloud, it’s a ‘cloud’. And it’s secret. (Secret? Really? There are a lot of words that come to mind when I think of compute clouds, but secret is not one of them. Just about every talk at OSCON last year mentioned the cloud.) Now the FTC wants the FCC to warn consumers that storing personal data “in the cloud” makes it easier for “hackers” to access it (and by hackers I mean federal law enforcement officials). While I agree that consumers should be careful about the type of information they share and store online (an admonition that is likely lost on the Facebook generation) and think about the larger issues around the cloud like ownership and control, personal information is not really a more significant issue in bioinformatics cloud computing than in bioinformatics local computing (other than the issue of the credit card number you use to pay for the service). Sure, if you are sequencing human genomes you need to transfer the data to and from the cloud securely, but for most projects we have to submit the data to central repositories anyway. So transferring data in a secure way, whether it be to clouds or NCBI, is a largely solved problem (data transfer rates notwithstanding). “How can we secure our data in the cloud?” is the common question that arises in cloud computing. While the consideration of security in the context of the cloud computing is laudable, it is likely (and unfortunate) that the same people raising the specter of security in the cloud don’t think as much about security on their own systems. In a recent post I mentioned how dead simple it was to perform security updates on an Ubuntu system. Unfortunately, despite it being simple, it often doesn’t get done. However, what is more insidious is a different kind of cloud security: wireless networks. A wireless network provides anyone with a Pringles can “physical” access to your network, yet often only minimal if any security is used on these networks. Add to that often lax physical security around company and university networks and I have to say I don’t really see data security as a major concern for me when it comes to cloud computing. That is not to say it is not a concern, rather that it does not concern me in the cloud much more than it does on my own network.

AddThis Social Bookmark Button

HiSeq 2000

January 12th, 2010 dd Posted in IT, genomics 6 Comments »

Today Illumina announced their new, high-throughput sequencing instrument, the HiSeq 2000. Sure, the name isn’t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30× coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current second-generation sequencing technologies? What it doesn’t do is change the fundamentals of the Illumina sequencing technology. The HiSeq 2000 uses Sequencing By Synthesis (SBS), just like the Genome Analyzer (GA). In fact, it actually dials down the current SBS state of the art, using lower cluster densities (350,000 – 400,000 clusters/mm2) and read lengths (2×100) than the latest GA IIx release (600,000 clusters/mm2 and 2×125). (Current tiles are 0.5293 mm2, so 600,000 clusters/mm2 equate to about 318,000 clusters/tile.) The throughput improvement comes from two major factors: increased data collection area and rate. The HiSeq 2000 has two 8-lane flow cells, as compared to the single flow cell on the GA, and images both the top and bottom surfaces of the flow cell. In addition, the imaging area of the HiSeq 2000 flow cell is larger than the GA flow cell’s. This all adds up to a more than five-fold increase in surface area to collect data from on the HiSeq 2000. As you know if you operate a GA, the imaging part of each cycle takes up more time than the chemistry portion. Thus, to run two flow cells on the same instrument, Illumina needed to speed up data acquisition so that it was at least as fast as the chemistry stage so that one flow cell could be doing chemistry while the other was imaging (like the SOLiD platform from Life Technologies). To do this, they used their experience with systems like iScan and its Time Delay and Integration (TDI) line imaging technology, and completely replaced the entire optics system. The GA performs area imaging to collect its image data. The flow cell is moved, the camera focuses, and four images (tiles) are taken (one for each base). The flow cell is then moved again and the process repeated. For the current GA IIx, each of the eight lanes is imaged at 120 positions (in a 2×60 grid) resulting in 480 images per lane per cycle. The HiSeq 2000 scans a 2048 pixel wide swath down one side of a lane and then comes back and scans the swath on the other side of the lane. This is then repeated for the other surface in the lane and then across all the lanes. Because of this continuous data collection, there are four cameras in the system rather than one. This line scanning system is able to collect data at a rate of 50 MB/s, as compared to about 8 MB/s in the GA IIx. When you put all of this together, the HiSeq 2000 is able to generate about 200 Gb of sequence from over 1 billion clusters in the form of 2×100 base reads from two flow cells in about eight days with error rates (1-2%) comparable to current GA IIx data (as one would expect since both use SBS). Illumina actually already has data from “production” instruments on several human genomes.

Because of the five-fold increase in sequence data generation rate (25 Gb/day versus 5 Gb/day for the GA IIx), Illumina needed to rethink how it processed and stored all the data. Normal hard drives cannot write four 625 MB images every 30 seconds. As such, images are not written to disk by default; they are processed in memory by the instrument control software (as opposed to the GA where image are written to the disk and processed by RTA which also does the base calling). You can save images if you want, but you will need 32 TB of disk space per run and it will slow down your run. Like the most recent version of RTA for the GA IIx, you can save thumbnail images (without penalty) to aid in troubleshooting (the thumbnails, of course, cannot be used for off-instrument analysis). Because of the need to incorporate phasing and pre-phasing information when base calling, the RTA for HiSeq lags a few cycles behind the current data acquisition cycle. The result is that base calling does not actually complete until about two hours after the run completes. In other words, the processing of data is not real time, but it is synchronous. In fact, if the data analysis falls behind, the instrument is paused in a safe state until it catches up. This is guaranteed to occur at least once in each run: after around five cycles the instrument will pause for about two hours while template generation (cluster identification) is performed. The large data rates also forced Illumina to rethink how they store and transfer data off the instrument. Gone are the QSEQ files, they are replaced by BCL files which are binary, per image, per cycle files that contain the base call and quality information. Because they are per image, per cycle files, they can be transferred cycle by cycle as they are generated (as opposed to QSEQ files which are read based). The BCL files are also more compact, requiring only 1 byte/base (B/b) as compared to QSEQ files which require about 2.5 B/b. In addition, the intensity files are also not transferred by default, so RTA output goes from 10 B/b to just 1 B/b. Thus, even though you are generating five times more sequence data than a GA, your RTA directory will actually be smaller (about 250 GB).

The HiSeq 2000 has a completely new instrument software user interface. The instrument user interface allows the operator to input data via a keyboard and mouse or a touch screen. Run configuration and setup are done via a wizard driven work flow. The setup and running of each flow cell is completely independent. This allows you to start the runs at different times, have different number of cycles for each flow cell, and even do an indexing run on one flow cell and a standard paired-end run on the other. The cycles of each flow cell will need to synchronize so that one is doing chemistry and the other data acquisition. Unfortunately, the current version of the instrument control software has no LIMS integration capabilities. Since this instrument is clearly targeting large genome centers, that is unfortunate.

The instrument software also has greatly enhanced real-time metric reporting as compared to the GA. In addition to the RTA reports, e.g., cluster density, intensity, focus, and quality scores, the standard reports typically generated after a GA run by GERALD, e.g., the Summary report, are generated cycle by cycle by RTA and made available to the operator via the instrument control software and remotely as HTML pages (there is also discussion of a smart phone application). Phi X can be spiked into lanes to allow the software to generate error rate numbers (and Error and Perfect plots) on the fly as well. All in all, the reports are very similar to those people have become familiar with using the GA; they are just generated dynamically during the run. This will allow operators to more carefully observe their runs and take corrective action if something goes awry. All of the extra data processing and reports do not come without the requirement of additional computational horsepower. Don’t worry though, no iPAR is necessary. The HiSeq instrument computer is just beefier than its GA counterpart: two quad-core 64-bit processors, 48 GiB of RAM, and a 64-bit Microsoft Windows Vista operating system. For downstream analysis, Illumina will still offer their IlluminaCompute (turn-key sequence data analysis cluster) but also is strongly pushing cloud-based analysis solutions (specifically Amazon AWS). Illumina has altered GERALD so ELANDv2 can run using more than one process per lane. Alignment of 200 Gb of data using ELANDv2 takes about 30 hours using 64 cores.

The good and the bad of this instrument is that it is really just more of the same. Illumina has taken the optics from iScan and combined that with the fluidics and chemistry of the GA. This means the system is more likely to “work” at launch than those of us dealing with new sequencing platforms are used to. It also means the data will be familiar (just more of it) and therefore will suffer from the same limitations (increasing errors with read length, short insert sizes). Shrinking from the bleeding edge of the GA in terms of cluster density and read length means the HiSeq likely has significant head room to increase well beyond 200 Gb/run. A quick back of the envelope calculation pushing the HiSeq to 600,000 clusters/mm2 and 2×150 read lengths results in 450 Gb/run. (Again, that is my rough calculation and not any sort of promise from Illumina.) So, while it may be more of the same, it is likely that it will be a lot more of the same. The ability to sequence a tumor and normal genome from an individual in a single instrument run in about a week is really going to change the calculation (and economics) for cancer sequencing going forward.

Update: The above text has been corrected to state that QSEQ files are about 2.5 B/b. It is the entire RTA output that is 10 B/b.

Update2: I’ve added some links.

AddThis Social Bookmark Button

Head in the clouds

January 10th, 2010 dd Posted in IT, genomics 2 Comments »

It seems that due to my recent post, Bioinformatics and cloud computing, I have been labeled a cloud skeptic. While I don’t reject that label outright, I won’t accept it either. If I may label myself, I would call myself a cloud realist. My first piece of evidence is that at the end of my previous post I specifically state, “This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that’s the topic of a future post.” Unfortunately, this is not the future post to which that statement refers. The purpose of this post is to respond to some of the comments made on that post and around the web.

First, Ben Langmead said,

My main comment is that you’re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers. That isn’t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn’t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.

Bob Carpenter then adds similar comments,

To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost. For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)? How much space do they take up? The power for these beasts is not inconsiderable… My wife’s having trouble with her cluster at NYU because the building’s heating and cooling are both tied to the same faulty plumbing system; so even though it’s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two. Just like when the AC went out in the summer.

Finally, Shiran Pasternak over at Plant Tech Tonics says

What his numbers don’t take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are high and expected to increase in the near future.

Yes, it is true that the cost for the Dell server I quoted was just the purchase price. But the price I quoted for a computing core in our cluster, $500, was a fully loaded cost. As indicated in the post, that fully loaded cost includes server, rack, networking, electrical hookup, installation, 3-year warranty, etc. In other words, that is the cost to add a core to an existing cluster and was provided for those researchers that do have clusters (as opposed to the cost of the Dell which was provided for those who do not). It does not include system administration, electrical power, or cooling. In other words, it does not include ongoing costs, only capital costs. Why did I not include those ongoing costs? Because I did not need to. To maintain pace with the sequence data generated by an Illumina GA IIx or two, you don’t need any of that stuff! For electrical power and cooling, the addition of a few cores to an existing computing infrastructure is not going to make a substantive difference in power or cooling. For a lab without an existing computing cluster, all you need is the desk where you sit your bioinformatician. If you are at a normally operating university, the electrical power and cooling to office space is provided from the overhead your university takes out of your grants. If you operate a core facility at a university, then you simply work these costs into the fees you charge (their contributions are several orders of magnitude less than the sequencing reagents). What about labs who have lots of sequencers but not a lot of computing power? Well, that’s bad planning and allocation of assets; no one can help you.

Systems administration costs are a similar story. For researchers with existing clusters, the addition of a few cores to keep pace with a few Illumina instruments will not require them to hire additional IT staff. For researchers without a cluster, I posit that it does not take more system administration costs to manage a single desktop workstation than it would to manage a cluster of Amazon EC2 nodes. Amazon EC2 provides virtual hardware and a stock installation of an operating system. Aside from the fact that you can purchase computers from Dell with Red Hat Enterprise [GNU/]Linux, any bioinformatician worth her salt (or any 12-year-old for that matter) can install Ubuntu on a computer. Just as the Dell customer will have to install their bioinformatics tools on the systems, so too will the Amazon EC2 customer; except they will need to install them on all the nodes they have rented. Regarding maintaining security patches and other updates, that is also dead simple in Ubuntu (although I will readily admit that just because something is easy, it does not necessarily follow that people will do it). The bottom line is that maintaining a workstation used for day-to-day activities and analyzing data from one or two Illumina instruments is more likely to be within the capabilities of a bioinformatician than setting up and maintaining an Amazon EC2 cluster.

Another point brought up in the above comments was reliability of the systems. One of the arguments in this area is that with your own hardware, you are responsible for maintaining the equipment while with Amazon EC2, they manage all the hardware. This is not really the case, though. All of the costs I have quoted included a 3-year warranty with on-site service. The reliability argument also involves downtime. If your local systems go down, whether for hardware failures, network outages, power outages, or Armageddon, it is true that you will not be able to do any computations on them, but you’re also not going to be able to access your EC2 systems and those EC2 systems will not be able to pull data from your systems (and in the case of Armageddon, Amazon EC2 will probably also be down).

So, that leaves us with the question, what would the fully loaded cost of the Dell workstation be, and what is the break even point with Amazon EC2? The cost of the quad-core system was roughly $1700. You only need one core for data analysis. Since you need to buy your bioinformatician a workstation anyway and it needs an operating system, bioinformatics software, power, and cooling, we’ll ignore those costs. So the purchase price becomes the fully loaded costs for comparison purposes. Assuming you would buy your bioinformatician a dual-core systems with 1 GiB of RAM (Firefox uses a lot of memory) which costs about $1000, the incremental cost of getting a machine capable of analyzing data is $700; the incremental cost per computing core is only $350. That dollar amount will buy you less than three genomes worth of analysis on Amazon EC2.

Bob Carpenter had a few other points worth addressing: viruses and running analysis multiple times. I would argue that the former is an issue regardless of where you run your analysis. Plus, for the GNU/Linux systems we are talking about in these scenarios, viruses are much less of an issue than they are for Microsoft Windows. Regarding running analysis multiple times, sure it would mean you may need more than one core to keep up, but it also means you are going to pay Amazon a lot more too. With the quad-core system quoted above, you have a whole extra core (two for the desktop, one for the single pass analysis, and one extra) to spill over into at no cost.

Before I close, I would like to thank all the commenters for raising the above points. All of the issues they raised are very important to consider when jumping into the next-generation informatics space. They also made it clear that my previous post was not as thorough as I thought it was when I hit the publish button. In addition to the excellent comments I quoted above, there were also several other good points regarding software in the comments of the previous post that I hope to incorporate in future posts (and hopefully this post will generate a few comments as well).

AddThis Social Bookmark Button