PolITiGenomics

Politics, Information Technology, and Genomics

Cloudy with a chance of sunshine

January 25th, 2010 dd Posted in IT No Comments »

As stated in previous posts (Bioinformatics and cloud computing and Head in the clouds), I don’t think that cloud computing wins the cost competition with local resources. However, there are several reasons why an organization should consider cloud computing. Several of the reasons I present below are discussed in a great interview with Russ Daniels of HP at ars technica, Into the cloud: a conversation with Russ Daniels, Part I and Part II. If you are at all curious about cloud computing, it is well worth reading. (You may also be interested in the ScienceCloud 2010 Workshop.)

Peaks and valleys

The ability to dynamically provision computing resources is integral to the concept of clouds. Dynamic provisioning is often used by online retailers to account for variability in consumer buying. The retailer may have 20 servers that it maintains year round to service average purchasing but also dynamically add servers in the cloud to account for peaks in purchasing, e.g., around the Christmas holiday. In bioinformatics, there are often computational crunches before papers get submitted or before meetings or when a mistake in an algorithm is found and a large amount of calculations need to be redone (Miron Livny of Condor and Open Science Grid calls these “oopses”). Another type of dynamic provisioning involves varying levels of certain hardware architectures or operating systems as needed by current computational demand. For example, certain applications may require x86 and Ubuntu 8.04 LTS while another may require amd64/em64t/x86_64 and Ubuntu 9.10. If the utilization of each of these programs is cyclical, you can provision the exact system you want when it is needed. This can be done using something like Amazon EC2 or an internal cloud. Thus, dynamic provisioning allows IT departments to design their solutions for steady state operations but still meet computational needs during peaks.

Space, the final frontier

At universities all over the world there is a constant battle for space. Researchers are always seeking more and administrators are always miserly about allocating it. If your computing needs expand beyond your ability to house, power, and cool them, cloud computing offers a solution. While it may not be cheaper than if the space, power, and cooling was available and paid for out of your grant overhead, it will almost certainly be cheaper than buying your own land and building your own data center. Of course, what people traditionally think of as cloud computing, e.g., Amazon EC2, is not the only option here. There are collocation facilities and scientific computing resources, e.g., NCSA and Open Science Grid. The latter are normally acquired through a granting process.

Persistence pays off

Cloud computing is also very attractive because of its persistence. If I have my computing and storage in the cloud, I can access it from anywhere. When the power goes out at my office, I can use my phone to access the data. When my computer crashes, the computation is still running on the cloud. When my disk fails, my data is still in the cloud. Of course, the cloud does fail at times too. Amazon promises 99.9% uptime, or nearly 9 hours of downtime per year. Of course, if the cloud resources are pulling data from your site (something that may take more time than the computation with current solutions), when your systems go down, you’re still out of luck.

AddThis Social Bookmark Button

Airline security

January 13th, 2010 dd Posted in IT No Comments »

Despite the fact that I was traveling when I wrote this, this post is not about air travel, but it is about security. One topic that continually comes up when the subject of cloud computing is discussed is security. A recent article in MIT Technology Review, Security in the Ether, discusses the issues. CNN tries to scare you with a title like A trip into the secret, online ‘cloud’. Spooky stuff. It’s not a cloud, it’s a ‘cloud’. And it’s secret. (Secret? Really? There are a lot of words that come to mind when I think of compute clouds, but secret is not one of them. Just about every talk at OSCON last year mentioned the cloud.) Now the FTC wants the FCC to warn consumers that storing personal data “in the cloud” makes it easier for “hackers” to access it (and by hackers I mean federal law enforcement officials). While I agree that consumers should be careful about the type of information they share and store online (an admonition that is likely lost on the Facebook generation) and think about the larger issues around the cloud like ownership and control, personal information is not really a more significant issue in bioinformatics cloud computing than in bioinformatics local computing (other than the issue of the credit card number you use to pay for the service). Sure, if you are sequencing human genomes you need to transfer the data to and from the cloud securely, but for most projects we have to submit the data to central repositories anyway. So transferring data in a secure way, whether it be to clouds or NCBI, is a largely solved problem (data transfer rates notwithstanding). “How can we secure our data in the cloud?” is the common question that arises in cloud computing. While the consideration of security in the context of the cloud computing is laudable, it is likely (and unfortunate) that the same people raising the specter of security in the cloud don’t think as much about security on their own systems. In a recent post I mentioned how dead simple it was to perform security updates on an Ubuntu system. Unfortunately, despite it being simple, it often doesn’t get done. However, what is more insidious is a different kind of cloud security: wireless networks. A wireless network provides anyone with a Pringles can “physical” access to your network, yet often only minimal if any security is used on these networks. Add to that often lax physical security around company and university networks and I have to say I don’t really see data security as a major concern for me when it comes to cloud computing. That is not to say it is not a concern, rather that it does not concern me in the cloud much more than it does on my own network.

AddThis Social Bookmark Button

HiSeq 2000

January 12th, 2010 dd Posted in genomics, IT 9 Comments »

Today Illumina announced their new, high-throughput sequencing instrument, the HiSeq 2000. Sure, the name isn’t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30× coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current second-generation sequencing technologies? What it doesn’t do is change the fundamentals of the Illumina sequencing technology. The HiSeq 2000 uses Sequencing By Synthesis (SBS), just like the Genome Analyzer (GA). In fact, it actually dials down the current SBS state of the art, using lower cluster densities (350,000 – 400,000 clusters/mm2) and read lengths (2×100) than the latest GA IIx release (600,000 clusters/mm2 and 2×125). (Current tiles are 0.5293 mm2, so 600,000 clusters/mm2 equate to about 318,000 clusters/tile.) The throughput improvement comes from two major factors: increased data collection area and rate. The HiSeq 2000 has two 8-lane flow cells, as compared to the single flow cell on the GA, and images both the top and bottom surfaces of the flow cell. In addition, the imaging area of the HiSeq 2000 flow cell is larger than the GA flow cell’s. This all adds up to a more than five-fold increase in surface area to collect data from on the HiSeq 2000. As you know if you operate a GA, the imaging part of each cycle takes up more time than the chemistry portion. Thus, to run two flow cells on the same instrument, Illumina needed to speed up data acquisition so that it was at least as fast as the chemistry stage so that one flow cell could be doing chemistry while the other was imaging (like the SOLiD platform from Life Technologies). To do this, they used their experience with systems like iScan and its Time Delay and Integration (TDI) line imaging technology, and completely replaced the entire optics system. The GA performs area imaging to collect its image data. The flow cell is moved, the camera focuses, and four images (tiles) are taken (one for each base). The flow cell is then moved again and the process repeated. For the current GA IIx, each of the eight lanes is imaged at 120 positions (in a 2×60 grid) resulting in 480 images per lane per cycle. The HiSeq 2000 scans a 2048 pixel wide swath down one side of a lane and then comes back and scans the swath on the other side of the lane. This is then repeated for the other surface in the lane and then across all the lanes. Because of this continuous data collection, there are four cameras in the system rather than one. This line scanning system is able to collect data at a rate of 50 MB/s, as compared to about 8 MB/s in the GA IIx. When you put all of this together, the HiSeq 2000 is able to generate about 200 Gb of sequence from over 1 billion clusters in the form of 2×100 base reads from two flow cells in about eight days with error rates (1-2%) comparable to current GA IIx data (as one would expect since both use SBS). Illumina actually already has data from “production” instruments on several human genomes.

Because of the five-fold increase in sequence data generation rate (25 Gb/day versus 5 Gb/day for the GA IIx), Illumina needed to rethink how it processed and stored all the data. Normal hard drives cannot write four 625 MB images every 30 seconds. As such, images are not written to disk by default; they are processed in memory by the instrument control software (as opposed to the GA where image are written to the disk and processed by RTA which also does the base calling). You can save images if you want, but you will need 32 TB of disk space per run and it will slow down your run. Like the most recent version of RTA for the GA IIx, you can save thumbnail images (without penalty) to aid in troubleshooting (the thumbnails, of course, cannot be used for off-instrument analysis). Because of the need to incorporate phasing and pre-phasing information when base calling, the RTA for HiSeq lags a few cycles behind the current data acquisition cycle. The result is that base calling does not actually complete until about two hours after the run completes. In other words, the processing of data is not real time, but it is synchronous. In fact, if the data analysis falls behind, the instrument is paused in a safe state until it catches up. This is guaranteed to occur at least once in each run: after around five cycles the instrument will pause for about two hours while template generation (cluster identification) is performed. The large data rates also forced Illumina to rethink how they store and transfer data off the instrument. Gone are the QSEQ files, they are replaced by BCL files which are binary, per image, per cycle files that contain the base call and quality information. Because they are per image, per cycle files, they can be transferred cycle by cycle as they are generated (as opposed to QSEQ files which are read based). The BCL files are also more compact, requiring only 1 byte/base (B/b) as compared to QSEQ files which require about 2.5 B/b. In addition, the intensity files are also not transferred by default, so RTA output goes from 10 B/b to just 1 B/b. Thus, even though you are generating five times more sequence data than a GA, your RTA directory will actually be smaller (about 250 GB).

The HiSeq 2000 has a completely new instrument software user interface. The instrument user interface allows the operator to input data via a keyboard and mouse or a touch screen. Run configuration and setup are done via a wizard driven work flow. The setup and running of each flow cell is completely independent. This allows you to start the runs at different times, have different number of cycles for each flow cell, and even do an indexing run on one flow cell and a standard paired-end run on the other. The cycles of each flow cell will need to synchronize so that one is doing chemistry and the other data acquisition. Unfortunately, the current version of the instrument control software has no LIMS integration capabilities. Since this instrument is clearly targeting large genome centers, that is unfortunate.

The instrument software also has greatly enhanced real-time metric reporting as compared to the GA. In addition to the RTA reports, e.g., cluster density, intensity, focus, and quality scores, the standard reports typically generated after a GA run by GERALD, e.g., the Summary report, are generated cycle by cycle by RTA and made available to the operator via the instrument control software and remotely as HTML pages (there is also discussion of a smart phone application). Phi X can be spiked into lanes to allow the software to generate error rate numbers (and Error and Perfect plots) on the fly as well. All in all, the reports are very similar to those people have become familiar with using the GA; they are just generated dynamically during the run. This will allow operators to more carefully observe their runs and take corrective action if something goes awry. All of the extra data processing and reports do not come without the requirement of additional computational horsepower. Don’t worry though, no iPAR is necessary. The HiSeq instrument computer is just beefier than its GA counterpart: two quad-core 64-bit processors, 48 GiB of RAM, and a 64-bit Microsoft Windows Vista operating system. For downstream analysis, Illumina will still offer their IlluminaCompute (turn-key sequence data analysis cluster) but also is strongly pushing cloud-based analysis solutions (specifically Amazon AWS). Illumina has altered GERALD so ELANDv2 can run using more than one process per lane. Alignment of 200 Gb of data using ELANDv2 takes about 30 hours using 64 cores.

The good and the bad of this instrument is that it is really just more of the same. Illumina has taken the optics from iScan and combined that with the fluidics and chemistry of the GA. This means the system is more likely to “work” at launch than those of us dealing with new sequencing platforms are used to. It also means the data will be familiar (just more of it) and therefore will suffer from the same limitations (increasing errors with read length, short insert sizes). Shrinking from the bleeding edge of the GA in terms of cluster density and read length means the HiSeq likely has significant head room to increase well beyond 200 Gb/run. A quick back of the envelope calculation pushing the HiSeq to 600,000 clusters/mm2 and 2×150 read lengths results in 450 Gb/run. (Again, that is my rough calculation and not any sort of promise from Illumina.) So, while it may be more of the same, it is likely that it will be a lot more of the same. The ability to sequence a tumor and normal genome from an individual in a single instrument run in about a week is really going to change the calculation (and economics) for cancer sequencing going forward.

Update: The above text has been corrected to state that QSEQ files are about 2.5 B/b. It is the entire RTA output that is 10 B/b.

Update2: I’ve added some links.

AddThis Social Bookmark Button

Head in the clouds

January 10th, 2010 dd Posted in genomics, IT 2 Comments »

It seems that due to my recent post, Bioinformatics and cloud computing, I have been labeled a cloud skeptic. While I don’t reject that label outright, I won’t accept it either. If I may label myself, I would call myself a cloud realist. My first piece of evidence is that at the end of my previous post I specifically state, “This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that’s the topic of a future post.” Unfortunately, this is not the future post to which that statement refers. The purpose of this post is to respond to some of the comments made on that post and around the web.

First, Ben Langmead said,

My main comment is that you’re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers. That isn’t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn’t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.

Bob Carpenter then adds similar comments,

To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost. For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)? How much space do they take up? The power for these beasts is not inconsiderable… My wife’s having trouble with her cluster at NYU because the building’s heating and cooling are both tied to the same faulty plumbing system; so even though it’s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two. Just like when the AC went out in the summer.

Finally, Shiran Pasternak over at Plant Tech Tonics says

What his numbers don’t take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are high and expected to increase in the near future.

Yes, it is true that the cost for the Dell server I quoted was just the purchase price. But the price I quoted for a computing core in our cluster, $500, was a fully loaded cost. As indicated in the post, that fully loaded cost includes server, rack, networking, electrical hookup, installation, 3-year warranty, etc. In other words, that is the cost to add a core to an existing cluster and was provided for those researchers that do have clusters (as opposed to the cost of the Dell which was provided for those who do not). It does not include system administration, electrical power, or cooling. In other words, it does not include ongoing costs, only capital costs. Why did I not include those ongoing costs? Because I did not need to. To maintain pace with the sequence data generated by an Illumina GA IIx or two, you don’t need any of that stuff! For electrical power and cooling, the addition of a few cores to an existing computing infrastructure is not going to make a substantive difference in power or cooling. For a lab without an existing computing cluster, all you need is the desk where you sit your bioinformatician. If you are at a normally operating university, the electrical power and cooling to office space is provided from the overhead your university takes out of your grants. If you operate a core facility at a university, then you simply work these costs into the fees you charge (their contributions are several orders of magnitude less than the sequencing reagents). What about labs who have lots of sequencers but not a lot of computing power? Well, that’s bad planning and allocation of assets; no one can help you.

Systems administration costs are a similar story. For researchers with existing clusters, the addition of a few cores to keep pace with a few Illumina instruments will not require them to hire additional IT staff. For researchers without a cluster, I posit that it does not take more system administration costs to manage a single desktop workstation than it would to manage a cluster of Amazon EC2 nodes. Amazon EC2 provides virtual hardware and a stock installation of an operating system. Aside from the fact that you can purchase computers from Dell with Red Hat Enterprise [GNU/]Linux, any bioinformatician worth her salt (or any 12-year-old for that matter) can install Ubuntu on a computer. Just as the Dell customer will have to install their bioinformatics tools on the systems, so too will the Amazon EC2 customer; except they will need to install them on all the nodes they have rented. Regarding maintaining security patches and other updates, that is also dead simple in Ubuntu (although I will readily admit that just because something is easy, it does not necessarily follow that people will do it). The bottom line is that maintaining a workstation used for day-to-day activities and analyzing data from one or two Illumina instruments is more likely to be within the capabilities of a bioinformatician than setting up and maintaining an Amazon EC2 cluster.

Another point brought up in the above comments was reliability of the systems. One of the arguments in this area is that with your own hardware, you are responsible for maintaining the equipment while with Amazon EC2, they manage all the hardware. This is not really the case, though. All of the costs I have quoted included a 3-year warranty with on-site service. The reliability argument also involves downtime. If your local systems go down, whether for hardware failures, network outages, power outages, or Armageddon, it is true that you will not be able to do any computations on them, but you’re also not going to be able to access your EC2 systems and those EC2 systems will not be able to pull data from your systems (and in the case of Armageddon, Amazon EC2 will probably also be down).

So, that leaves us with the question, what would the fully loaded cost of the Dell workstation be, and what is the break even point with Amazon EC2? The cost of the quad-core system was roughly $1700. You only need one core for data analysis. Since you need to buy your bioinformatician a workstation anyway and it needs an operating system, bioinformatics software, power, and cooling, we’ll ignore those costs. So the purchase price becomes the fully loaded costs for comparison purposes. Assuming you would buy your bioinformatician a dual-core systems with 1 GiB of RAM (Firefox uses a lot of memory) which costs about $1000, the incremental cost of getting a machine capable of analyzing data is $700; the incremental cost per computing core is only $350. That dollar amount will buy you less than three genomes worth of analysis on Amazon EC2.

Bob Carpenter had a few other points worth addressing: viruses and running analysis multiple times. I would argue that the former is an issue regardless of where you run your analysis. Plus, for the GNU/Linux systems we are talking about in these scenarios, viruses are much less of an issue than they are for Microsoft Windows. Regarding running analysis multiple times, sure it would mean you may need more than one core to keep up, but it also means you are going to pay Amazon a lot more too. With the quad-core system quoted above, you have a whole extra core (two for the desktop, one for the single pass analysis, and one extra) to spill over into at no cost.

Before I close, I would like to thank all the commenters for raising the above points. All of the issues they raised are very important to consider when jumping into the next-generation informatics space. They also made it clear that my previous post was not as thorough as I thought it was when I hit the publish button. In addition to the excellent comments I quoted above, there were also several other good points regarding software in the comments of the previous post that I hope to incorporate in future posts (and hopefully this post will generate a few comments as well).

AddThis Social Bookmark Button

Bioinformatics and cloud computing

November 24th, 2009 dd Posted in genomics, IT 9 Comments »

From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month’s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg‘s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera blog post on Crossbow). In the paper the authors describe how they were able to analyze the human sequence data published last year by BGI using Amazon EC2. Specifically, they have developed an alignment (bowtie) and SNP detection (SoapSNP) pipeline that is executed in parallel across a cluster using the Hadoop framework (a free software implementation of Google’s MapReduce framework). Using a 40-node, 320-core EC2 cluster, they were able to analyze 38× coverage sequence data in about three hours. The whole analysis, including data transfer and storage on Amazon S3, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr’s HPCInfo post and more detail on the SNP detection on Dan Koboldt’s Mass Genomics post.

For analyzing a single genome, you really can’t beat that price. Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000. It’s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38× coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core·hours to align, so a whole run’s (eight lanes’) worth of data would take about 80 core·hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core·hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn’t buy just one core. Checking over at the Dell Higher Education web site, you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the Amazon EC2 Extra Large Instance used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core’s (25%) of that workstation’s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to Burrows-Wheeler Transform aligners like bowtie and bwa). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using the entire cost of the Dell workstation (even though you require less than 25% of its computational capacity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.

These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the “few genomes” bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.

AddThis Social Bookmark Button

What’s in an Illumina GA run directory?

October 28th, 2009 dd Posted in genomics, IT No Comments »

One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a lot of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This large number of files and the parallel access of these files by large computational clusters tends to give most storage solutions great difficulty.

So what, exactly, is in an Illumina run directory? Well, to get breakdowns of file statistics there is a nifty little tool called fsstats. It is just a simple Perl script that crawls through a directory stat’ing files and reporting metrics. For example, when you run it on an Illumina GA IIx 2×100, high cluster density run after the primary analysis has completed, you get the following information about the distribution of file sizes. (I have rearranged and condensed the information to make it fit.)

total 7.46 TB used to store 7.46 TB user data, overhead 0.04%
  count=991227 avg=8076.50 KB
  min=0.00 KB max=13128679.30 KB
           size range    count   %tot  %tot cum       total size   %tot  %tot cum
[       0-       2 KB):   4019 ( 0.41) (  0.41)       3009.03 KB ( 0.00) (  0.00)
[       2-       4 KB):      2 ( 0.00) (  0.41)          6.99 KB ( 0.00) (  0.00)
[       4-       8 KB):    981 ( 0.10) (  0.50)       5964.82 KB ( 0.00) (  0.00)
[       8-      16 KB): 193351 (19.51) ( 20.01)    2588619.88 KB ( 0.03) (  0.03)
[      16-      32 KB):   2656 ( 0.27) ( 20.28)      58586.79 KB ( 0.00) (  0.03)
[      32-      64 KB):    901 ( 0.09) ( 20.37)      31369.79 KB ( 0.00) (  0.03)
[      64-     128 KB):   2893 ( 0.29) ( 20.66)     303872.38 KB ( 0.00) (  0.04)
[     128-     256 KB):      2 ( 0.00) ( 20.66)        345.34 KB ( 0.00) (  0.04)
[     256-     512 KB):      4 ( 0.00) ( 20.66)       1222.53 KB ( 0.00) (  0.04)
[     512-    1024 KB):      1 ( 0.00) ( 20.66)        622.26 KB ( 0.00) (  0.04)
[    1024-    2048 KB):      2 ( 0.00) ( 20.66)       3199.89 KB ( 0.00) (  0.04)
[    2048-    4096 KB):     12 ( 0.00) ( 20.66)      41779.69 KB ( 0.00) (  0.04)
[    4096-    8192 KB): 776654 (78.35) ( 99.02) 5863161178.18 KB (73.24) ( 73.28)
[   16384-   32768 KB):     21 ( 0.00) ( 99.02)     487156.46 KB ( 0.01) ( 73.28)
[   32768-   65536 KB):   3856 ( 0.39) ( 99.41)  163552521.17 KB ( 2.04) ( 75.32)
[   65536-  131072 KB):   3825 ( 0.39) ( 99.79)  307535341.32 KB ( 3.84) ( 79.17)
[  131072-  262144 KB):    133 ( 0.01) ( 99.81)   32458046.12 KB ( 0.41) ( 79.57)
[  262144-  524288 KB):   1787 ( 0.18) ( 99.99)  658830514.46 KB ( 8.23) ( 87.80)
[ 2097152- 4194304 KB):     16 ( 0.00) ( 99.99)   47898262.36 KB ( 0.60) ( 88.40)
[ 4194304- 8388608 KB):     64 ( 0.01) (100.00)  432084134.39 KB ( 5.40) ( 93.80)
[ 8388608-16777216 KB):     47 ( 0.00) (100.00)  496603147.67 KB ( 6.20) (100.00)

So the total size of the run directory is nearly 7.5 TB and there are almost one million files. The average size of a file in the run directory is about 8 MB and the maximum size is over 13 GB. The images (represented in the 4096-8192 KB range), comprise over 78% of the files and 73% of the total size of the run directory. This significant penalty can be avoided by using RTA and not transferring image files. The largest files are the alignment (ELAND) outputs and the FASTQ files in the GERALD directory. Speaking of directories, here is a breakdown by number of files in each directory.

  count=1652 avg=601.02 ents
  min=0.00 ents max=24720.00 ents
              range   count   %tot  %tot cum total ent   %tot  %tot cum
  [    0-    1 ents]:     4 ( 0.24) (  0.24)      0.00 ( 0.00) (  0.00)
  [    2-    3 ents]:     1 ( 0.06) (  0.30)      2.00 ( 0.00) (  0.00)
  [    8-   15 ents]:     3 ( 0.18) (  0.48)     26.00 ( 0.00) (  0.00)
  [   16-   31 ents]:     2 ( 0.12) (  0.61)     44.00 ( 0.00) (  0.01)
  [  128-  255 ents]:     9 ( 0.54) (  1.15)   1826.00 ( 0.18) (  0.19)
  [  256-  511 ents]:  1616 (97.82) ( 98.97) 775680.00 (78.12) ( 78.32)
  [  512- 1023 ents]:     3 ( 0.18) ( 99.15)   2920.00 ( 0.29) ( 78.61)
  [ 1024- 2047 ents]:     4 ( 0.24) ( 99.39)   7845.00 ( 0.79) ( 79.40)
  [ 2048- 4095 ents]:     2 ( 0.12) ( 99.52)   6775.00 ( 0.68) ( 80.08)
  [16384-32767 ents]:     8 ( 0.48) (100.00) 197760.00 (19.92) (100.00)

The picture for directory entries is a bit muddled since most of the directories are organized around a small multiple of the number of tiles per lane, falling in the 256-511 entries range. The directories in the 16384-32767 entries range? The image analysis (Firecrest) Temp/L00[1-8] directories, each with 24,720 entries (four clu.txt per tile (one per color) and one qcm.xml (XML!) file for each cycle for each tile in a lane).

AddThis Social Bookmark Button

More than cells, more than bytes

October 23rd, 2009 dd Posted in genomics, IT 4 Comments »

In a recent New York Times article, Jimmy Lin from of the University of Maryland is quoted as saying, “Science these days has basically turned into a data-management problem.” If this is true, then those responsible for data management have failed. The last thing scientists should be worrying about is managing data. Mining data, sure, but managing data? While the efforts documented in that story to begin to teach scientists how to grapple with large amounts of data are laudable, they all seem to focus on computer scientists, not biologists or chemists or physicists. There will be few people who can understand the worlds of, for example, biology and computer science deeply. What is needed are those who can understand one of these disciplines deeply and extend into other disciplines as needed. These individuals can act as connections, glue, between disciplines and accelerate research in these areas, which more and more require many domains of expertise. For example, designing DNA sequencing instruments requires deep understanding in fields as diverse as optics, quantum mechanics, chemistry, biology, mechanical engineering, computer science, and computer engineering. No one person can master all these fields, but people are desperately needed to bridge between them. As Chad Fowler writes in the section of his book The Passionate Programmer entitled Coding Don’t Cut It Anymore,

If you want to stay relevant, you’re going to have to dive into the domain of the business you’re in.
In fact, a software person should understand a business domain not only well enough to develop software for it but also to become one of its authorities.

AddThis Social Bookmark Button

Next-Generation Sequencing Informatics table update

October 5th, 2009 dd Posted in genomics, IT 2 Comments »

I have made some updates to the Next-Generation Sequencing Informatics table. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I’d appreciate you passing them along to me (I’m looking at you drd).

Update: I received some SOLiD 3 number from Nicholas Socci (thanks Nicholas!).

Update2: I received a fuller set of numbers from drd and the SOLiD 3 column is complete (thanks drd!).

AddThis Social Bookmark Button

Expansion

October 2nd, 2009 dd Posted in IT No Comments »

LEED Certification

The Genome Data Center has received a Gold LEED Certification from the U.S. Green Building Council. This is in addition to the Keystone Award from the St. Louis Association of General Contractors. It is quite an achievement for a power hungry data center to receive a LEED certification, much more a Gold Certification, but the WUSM Design and Construction team along with the architects, engineers, and contractors were able to pull it off.

Recently the final phase of construction at the Genome Data Center was completed. The initial build out had enough power and cooling for about 40 racks of equipment. Now at full capacity, the data center is capable of supplying 4 MW of power (about the amount used by 800 homes on a hot day) and the requisite cooling to the equipment housed within it. This will support over 100 racks worth of high-density computational (blades) and storage equipment and its supporting infrastructure (chilled water plants, air handlers, humidity control, office space, etc.). The electrical system is completely redundant, all the way to the double-ended substation of our electrical utility. That means even if we lose one entire electrical feed, we can still operate on utility power. If we lose both electrical feeds, we have battery and fly-wheel UPS systems to carry us until the two 2 MW diesel generators start (under 10 seconds). generator The building is about 1480 m2 while the actual data center is about 288 m2 (as they shrink computing equipment, the required electrical and cooling equipment keeps increasing in size). The data center is arranged in a standard hot aisle/cold aisle layout with cooling delivered from below through floor grates (perf plates did not provide enough airflow) via a 1.2 m raised floor. cold aisle We currently have about 3,000 cores in our computational cluster and over 3 PB (3,000,000 GB) of storage online. When full of equipment in a few years, the data center will likely house tens of thousands of cores and on the order of 100 PB of storage.

There are more pictures of the Genome Data Center on Flickr.

AddThis Social Bookmark Button

My secret past

September 16th, 2009 dd Posted in genomics, IT No Comments »

Now everyone will know about my secret past before I joined The Genome Center: David Dooling: Gangbusters at the Genome Center. Bio-IT World also has a nice interview with Clive Brown of Oxford Nanopore, whom I first described as the most honest guy in all of next-gen sequencing.

By the way, sorry for the extended absence, things have been crazy.

AddThis Social Bookmark Button