<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; compute</title>
	<atom:link href="http://www.politigenomics.com/tag/compute/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Lightning strike</title>
		<link>http://www.politigenomics.com/2010/04/lightning-strike.html</link>
		<comments>http://www.politigenomics.com/2010/04/lightning-strike.html#comments</comments>
		<pubDate>Thu, 22 Apr 2010 02:25:32 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2147</guid>
		<description><![CDATA[A previous cloud post, Puff piece, has gotten a bit of attention from Jason Stowe and Informatics Iron. While the Informatics Iron piece was positive, Mr. Stowe took issue with some of the points I made. First, he says that my claim that IT and software engineering is needed to get things running on the [...]]]></description>
			<content:encoded><![CDATA[<p>A previous cloud post, <a href="http://www.politigenomics.com/2010/02/puff-piece.html">Puff piece</a>, has gotten a bit of attention from <a href="http://blog.cyclecomputing.com/2010/02/follow-up-on-life-science-reader.html">Jason Stowe</a> and <a href="http://www.genomeweb.com/blog/zero-tolerance-policy-cloud-computing-balderdash">Informatics Iron</a>. While the Informatics Iron piece was positive, Mr. Stowe took <a href="http://www.politigenomics.com/2010/02/puff-piece.html/comment-page-1?comment-16466">issue with some of the points I made</a>. First, he says that my claim that IT and software engineering is needed to get things running on the cloud is inaccurate.<br />
<blockquote>You are implying that to get running in the cloud, an end user must worry about the “IT expertise” and “software engineering” needed to get applications up and running. I believe this is a straw-man, an incorrect assertion to begin with.</p>
<p>One of the major benefits of virtualized infrastructure and service oriented architectures is that they are repeatable and decouple the knowledge of building the service from the users consuming it. This means that one person, who creates the virtual machine images or the server code running the service, does need the expertise to get an application running properly in the cloud. But after that engineering is done once, a whole community of end-users of that service can benefit without knowledge of the specifics of getting the application to scale.</p>
<p>For example, does everyone that uses GMail/Yahoo/Hotmail know every line of software code to make it run? Do they know every operational aspect of how to make mail scale to tens of thousands of processors across many data centers?</p>
<p>Definitely not, and the point is they don’t have to. The same is true for high performance and high throughput computing. To give examples of free services that don’t require end user software engineering or IT expertise to do bioinformatics/proteomics/etc.:
<ul>
<li>The NIH Website for BLAST has, for years, been running BLAST as a service so that researchers can use GUIs to run queries on parallel back-end infrastructure (see http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) This requires no complicated knowledge or software engineering for scientists to run BLAST as a Service.</li>
<li>Tools like ViPDAC have 2-minute tutorial videos to run proteomics on Amazon Web Service.</li>
</ul>
</blockquote>
<p> His argument is absolutely correct when dealing with established systems, applications, and work flows. For use cases like email and running BLAST, there is no need for additional software engineering or IT expertise (other than getting on the internet). In fact, The Genome Center has long offered a <a href="http://genome.wustl.edu/tools/blast">BLAST service</a> for anyone to use. Further, over the past few weeks, several prepackaged bioinformatics work flows that run on the cloud (or some approximation thereof) have been announced: Mr. Stowe&#8217;s company Cycle Computing announced <a href="http://www.cyclecomputing.com/news/28-newsitems/214-cycle-computing-launches-cyclecloud-for-life-sciences-product-family-at-xgen-sequencing">CycleCloud for Life Sciences</a>, <a href="http://www.genomequest.com/">GenomeQuest SDM</a>, <a href="http://www.cloudbiolinux.com/">Cloud Bio-Linux</a> from Bio-Team, ChIP-seq and RNA-seq analysis pipelines from <a href="https://dnanexus.com/">DNAnexus</a>, the work flows available in <a href="http://bitbucket.org/galaxy/galaxy-central/wiki/cloud">Galaxy</a>, and of course the previously published <a href="http://genomebiology.com/2009/10/11/R134">Crossbow</a>.  Unfortunately, canned analyses are not the norm in bioinformatics. Bioinformaticians love to tinker, trying to get just a little more biological information out of their data sets. The result is that bioinformatics applications and work flows are constantly being tweaked, updated, and improved. Because of this, maintenance of these pipelines is a huge burden. The supporters of these generic pipelines must work constantly to update and verify software or the users will constantly be waiting for the latest fix to be applied or latest feature to be available (anyone who installs each new version of <a href="http://www.ebi.ac.uk/~zerbino/velvet/">velvet</a> can attest to this). The saving grace in all of this is that as the use of sequencing becomes more widespread, the percentage of the people doing the analysis that classify as bioinformaticians will decrease (greatly). This means that a larger and larger percentage of people with sequence data to analyze will likely not be interested in tweaking analysis pipelines but will just want to run something and get an answer. It is this ever growing group of people that will greatly benefit from easy to use analysis tools, whether they be deployed on the cloud or not. Both Mr. Stowe and I agree that creating easy to use tools for non-bioinformaticians to use is a very worthwhile goal. Unfortunately the proliferation of existing tool options (e.g., maq, bwa, bowtie, bfast, soap, novoalign, etc.) now layered with a proliferation of cloud offerings will make it even more difficult for non-experts to chose which pipeline is the best to use. Therefore approaches like those taken by Cycle Computing and GenomeQuest that provide default analysis pipelines <strong>and</strong> the ability for bioinformaticians to create and share their own work flows are the most likely to be successful. The development of these generic, distributed analysis frameworks that also provide useful defaults is an even more worthwhile goal because it achieves two important ends: ease of use for non-experts and the ability for bioinformaticians to tinker. Bioinformaticians are more likely to find tools like these useful and therefore will be early adopters, choose the best platforms, establish best-practices on these platforms, publish results using these platforms, and <em>then</em> the non-experts will follow.</p>
<p>Mr. Stowe&#8217;s other objection related to my point that no process scales linearly with the number of cores. He concedes that point but points out<br />
<blockquote>In fact, regardless of whether the job is linearly scalable, most companies and research institutions don’t have 1 cluster to 1 user scenarios. There are multiple users with multiple jobs each. What if you have 10 crossbow users with 10 runs to do on various genomes? Then you can get 100x performance on the *workflow as a whole*.</p></blockquote>
<p> Again, this is true, but, to be fair, that is not the same point he made in his original article. His original point was that if <em>you</em> needed <em>your</em> analysis to run faster you could just provision more nodes. I just pointed out that this is true, but <em>you</em> would likely pay a premium for that because <strong>nothing</strong> scales linearly. It may seem like a fine distinction, but with all the misinformation around clouds nowadays, it&#8217;s an important one to make. It should also be noted that without good software engineering and system administration, even algorithms that should scale nearly linearly might not. The take-home message is that if someone has done that software engineering and systems administration work to make a program scale well and run well in a cloud envrionment and made it available to you, great. If not, someone is going to have to do it.</p>
<p>I had the opportunity to meet Mr. Stowe at the XGen Congress and have talked more with him this week at <a href="http://www.bio-itworldexpo.com/">Bio-IT World Conference and Expo</a> (my talk is tomorrow at 11 a.m. EDT in <a href="http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=94894">Track 3: Bioinformatics and Next-Gen Data</a>). We had a good discussion about cloud computing and its role in bioinformatics (they&#8217;ve got a cool solution to the Amazon storage problem). As you can hopefully tell from this post, we are largely in agreement: engineering is needed, but once it is done, everyone benefits. <a href="http://www.cyclecomputing.com/">Cycle Computing</a> certainly has a lot of good expertise in the cloud, so if you need some engineering done, shoot him an email. Unfortunately, they probably will not be able to help you access the <a href="http://www.networkworld.com/community/node/58829">largest cloud computing service</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/04/lightning-strike.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics Update</title>
		<link>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html</link>
		<comments>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:55:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2143</guid>
		<description><![CDATA[I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;somewhat dated comparison of next-generation sequencing platforms.&#8221; Just [...]]]></description>
			<content:encoded><![CDATA[<p>I updated the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a> a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the <a href="http://www.illumina.com/systems/genome_analyzer_iix.ilmn">Illumina GA IIx</a>. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;<a href="http://sidesandassociates.com/blog/2010/01/01/the-business-of-sequencing/">somewhat dated comparison of next-generation sequencing platforms</a>.&#8221; Just to clarify, this table represents <em>average</em> throughput for <em>production</em> systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Cloudy with a chance of sunshine</title>
		<link>http://www.politigenomics.com/2010/01/cloudy-with-a-chance-of-sunshine.html</link>
		<comments>http://www.politigenomics.com/2010/01/cloudy-with-a-chance-of-sunshine.html#comments</comments>
		<pubDate>Mon, 25 Jan 2010 12:20:42 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1906</guid>
		<description><![CDATA[As stated in previous posts (Bioinformatics and cloud computing and Head in the clouds), I don&#8217;t think that cloud computing wins the cost competition with local resources. However, there are several reasons why an organization should consider cloud computing. Several of the reasons I present below are discussed in a great interview with Russ Daniels [...]]]></description>
			<content:encoded><![CDATA[<p>As stated in previous posts (<a href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html">Bioinformatics and cloud computing</a> and <a href="http://www.politigenomics.com/2010/01/head-in-the-clouds.html">Head in the clouds</a>), I don&#8217;t think that cloud computing wins the cost competition with local resources. However, there are several reasons why an organization should consider cloud computing. Several of the reasons I present below are discussed in a great interview with Russ Daniels of HP at ars technica, <a href="http://arstechnica.com/business/news/2008/12/hp-cloud-computing-interview.ars/1">Into the cloud: a conversation with Russ Daniels, Part I</a> and <a href="http://arstechnica.com/business/news/2009/02/into-the-cloud-a-conversation-with-russ-daniels-part-ii.ars/1">Part II</a>. If you are at all curious about cloud computing, it is well worth reading. (You may also be interested in the <a href="http://dsl.cs.uchicago.edu/ScienceCloud2010/index.html">ScienceCloud 2010 Workshop</a>.)</p>
<h3>Peaks and valleys</h3>
<p> The ability to dynamically provision computing resources is integral to the concept of clouds. Dynamic provisioning is often used by online retailers to account for variability in consumer buying. The retailer may have 20 servers that it maintains year round to service average purchasing but also dynamically add servers in the cloud to account for peaks in purchasing, e.g., around the Christmas holiday. In bioinformatics, there are often computational crunches before papers get submitted or before meetings or when a mistake in an algorithm is found and a large amount of calculations need to be redone (<a href="http://pages.cs.wisc.edu/~miron/">Miron Livny</a> of Condor and Open Science Grid calls these &#8220;oopses&#8221;). Another type of dynamic provisioning involves varying levels of certain hardware architectures or operating systems as needed by current computational demand. For example, certain applications may require x86 and Ubuntu 8.04 LTS while another may require amd64/em64t/x86_64 and Ubuntu 9.10. If the utilization of each of these programs is cyclical, you can provision the exact system you want when it is needed. This can be done using something like Amazon EC2 <em>or</em> an internal cloud. Thus, dynamic provisioning allows IT departments to design their solutions for steady state operations but still meet computational needs during peaks.</p>
<h3>Space, the final frontier</h3>
<p>At universities all over the world there is a constant battle for space. Researchers are always seeking more and administrators are always miserly about allocating it. If your computing needs expand beyond your ability to house, power, and cool them, cloud computing offers a solution. While it may not be cheaper than if the space, power, and cooling was available and paid for out of your grant overhead, it will almost certainly be cheaper than buying your own land and building your own data center. Of course, what people traditionally think of as cloud computing, e.g., Amazon EC2, is not the only option here. There are collocation facilities and scientific computing resources, e.g., <a href="http://www.ncsa.illinois.edu/">NCSA</a> and <a href="http://www.opensciencegrid.org/">Open Science Grid</a>. The latter are normally acquired through a granting process.</p>
<h3>Persistence pays off</h3>
<p>Cloud computing is also very attractive because of its persistence. If I have my computing and storage in the cloud, I can access it from anywhere. When the power goes out at my office, I can use my phone to access the data. When my computer crashes, the computation is still running on the cloud. When my disk fails, my data is still in the cloud. Of course, the cloud does fail at times too. Amazon promises 99.9% uptime, or nearly 9 hours of downtime per year. Of course, if the cloud resources are pulling data from your site (something that may take more time than the computation with current solutions), when your systems go down, you&#8217;re still out of luck.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/cloudy-with-a-chance-of-sunshine.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Airline security</title>
		<link>http://www.politigenomics.com/2010/01/airline-security.html</link>
		<comments>http://www.politigenomics.com/2010/01/airline-security.html#comments</comments>
		<pubDate>Wed, 13 Jan 2010 20:59:56 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1820</guid>
		<description><![CDATA[Despite the fact that I was traveling when I wrote this, this post is not about air travel, but it is about security. One topic that continually comes up when the subject of cloud computing is discussed is security. A recent article in MIT Technology Review, Security in the Ether, discusses the issues. CNN tries [...]]]></description>
			<content:encoded><![CDATA[<p>Despite the fact that I was traveling when I wrote this, this post is not about air travel, but it is about security. One topic that continually comes up when the subject of cloud computing is discussed is security. A recent article in MIT Technology Review, <a href="http://www.technologyreview.com/web/24166/">Security in the Ether</a>, discusses the issues. CNN tries to scare you with a title like <a href="http://www.cnn.com/2009/TECH/11/04/cloud.computing.hunt/index.html">A trip into the secret, online &#8216;cloud&#8217;</a>. Spooky stuff. It&#8217;s not a cloud, it&#8217;s a &#8216;cloud&#8217;. And it&#8217;s secret. (Secret? Really? There are a lot of words that come to mind when I think of compute clouds, but secret is not one of them. Just about every talk at OSCON last year mentioned the cloud.) Now the <a href="http://arstechnica.com/tech-policy/news/2010/01/ftc-reminds-us-that-storing-data-in-the-cloud-has-drawbacks.ars">FTC wants the FCC to warn consumers</a> that storing personal data &#8220;in the cloud&#8221; makes it easier for &#8220;hackers&#8221; to access it (and by hackers I mean federal law enforcement officials). While I agree that consumers should be careful about the type of information they share and store online (an admonition that is likely lost on the <a href="http://www.eweekeurope.co.uk/news/facebook-s-zuckerberg-questions-privacy-expectations-2983">Facebook generation</a>) and think about <a href="http://www.guardian.co.uk/technology/2009/sep/02/cory-doctorow-cloud-computing">the larger issues around the cloud</a> like ownership and control, personal information is not really a more significant issue in bioinformatics cloud computing than in bioinformatics local computing (other than the issue of the credit card number you use to pay for the service). Sure, if you are sequencing human genomes you need to transfer the data to and from the cloud securely, but for most projects we have to submit the data to central repositories anyway. So transferring data in a secure way, whether it be to clouds or NCBI, is a largely solved problem (data transfer rates notwithstanding). &#8220;How can we secure our data in the cloud?&#8221; is the common question that arises in cloud computing. While the consideration of security in the context of the cloud computing is laudable, it is likely (and unfortunate) that the same people raising the specter of security in the cloud don&#8217;t think as much about security on their own systems. In a <a href="http://www.politigenomics.com/2010/01/head-in-the-clouds.html">recent post</a> I mentioned how dead simple it was to perform security updates on an Ubuntu system. Unfortunately, despite it being simple, it often doesn&#8217;t get done. However, what is more insidious is a different kind of cloud security: wireless networks. A wireless network provides anyone with a <a href="http://www.oreillynet.com/cs/weblog/view/wlg/448">Pringles can</a> &#8220;physical&#8221; access to your network, yet often only minimal if any security is used on these networks.  Add to that often lax physical security around company and university networks and I have to say I don&#8217;t really see data security as a major concern for me when it comes to cloud computing. That is not to say it is not a concern, rather that it does not concern me in the cloud much more than it does on my own network.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/airline-security.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HiSeq 2000</title>
		<link>http://www.politigenomics.com/2010/01/hiseq-2000.html</link>
		<comments>http://www.politigenomics.com/2010/01/hiseq-2000.html#comments</comments>
		<pubDate>Wed, 13 Jan 2010 00:48:53 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1914</guid>
		<description><![CDATA[Today Illumina announced their new, high-throughput sequencing instrument, the HiSeq 2000. Sure, the name isn&#8217;t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30&#215; coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.illumina.com/systems/hiseq_2000.ilmn"><img alt="" src="http://www.illumina.com/images/systems/hiseq_2000.jpg" title="HiSeq 2000" class="alignright" width="265" height="290" /></a></p>
<p>Today Illumina announced their new, high-throughput sequencing instrument, the <a href="http://www.illumina.com/systems/hiseq_2000.ilmn">HiSeq 2000</a>. Sure, the name isn&#8217;t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30&times; coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current second-generation sequencing technologies? What it doesn&#8217;t do is change the fundamentals of the Illumina sequencing technology. The HiSeq 2000 uses <a href="http://www.illumina.com/technology/sequencing_technology.ilmn">Sequencing By Synthesis (SBS)</a>, just like the Genome Analyzer (GA). In fact, it actually dials down the current SBS state of the art, using lower cluster densities (350,000 &#8211; 400,000 clusters/mm<sup>2</sup>) and read lengths (2&times;100) than the latest GA IIx release (600,000 clusters/mm<sup>2</sup> and 2&times;125). (Current tiles are 0.5293 mm<sup>2</sup>, so 600,000 clusters/mm<sup>2</sup> equate to about 318,000 clusters/tile.) The throughput improvement comes from two major factors: increased data collection <em>area</em> and <em>rate</em>. The HiSeq 2000 has two 8-lane flow cells, as compared to the single flow cell on the GA, and images both the top and bottom surfaces of the flow cell. In addition, the imaging area of the HiSeq 2000 flow cell is larger than the GA flow cell&#8217;s. This all adds up to a more than five-fold increase in surface area to collect data from on the HiSeq 2000. As you know if you operate a GA, the imaging part of each cycle takes up more time than the chemistry portion. Thus, to run two flow cells on the same instrument, Illumina needed to speed up data acquisition so that it was at least as fast as the chemistry stage so that one flow cell could be doing chemistry while the other was imaging (like the <a href="http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiD-System-Sequencing-C/index.htm">SOLiD</a> platform from Life Technologies). To do this, they used their experience with systems like iScan and its <a href="http://en.wikipedia.org/wiki/Time_Delay_and_Integration">Time Delay and Integration (TDI)</a> line imaging technology, and completely replaced the entire optics system. The GA performs area imaging to collect its image data. The flow cell is moved, the camera focuses, and four images (tiles) are taken (one for each base). The flow cell is then moved again and the process repeated. For the current GA IIx, each of the eight lanes is imaged at 120 positions (in a 2&times;60 grid) resulting in 480 images per lane per cycle. The HiSeq 2000 scans a 2048 pixel wide swath down one side of a lane and then comes back and scans the swath on the other side of the lane. This is then repeated for the other surface in the lane and then across all the lanes. Because of this continuous data collection, there are four cameras in the system rather than one. This line scanning system is able to collect data at a rate of 50 MB/s, as compared to about 8 MB/s in the GA IIx. When you put all of this together, the HiSeq 2000 is able to generate about 200 Gb of sequence from over 1 billion clusters in the form of 2&times;100 base reads from two flow cells in about eight days with error rates (1-2%) comparable to current GA IIx data (as one would expect since both use SBS). Illumina actually already has data from &#8220;production&#8221; instruments on several human genomes.</p>
<p>Because of the five-fold increase in sequence data generation rate (25 Gb/day  versus 5 Gb/day for the GA IIx), Illumina needed to rethink how it processed and stored all the data. Normal hard drives cannot write four 625 MB images every 30 seconds. As such, images are not written to disk by default; they are processed in memory by the instrument control software (as opposed to the GA where image are written to the disk and processed by RTA which also does the base calling). You can save images if you want, but you will need 32 TB of disk space per run and it will slow down your run. Like the most recent version of RTA for the GA IIx, you can save thumbnail images (without penalty) to aid in troubleshooting (the thumbnails, of course, cannot be used for off-instrument analysis). Because of the need to incorporate phasing and pre-phasing information when base calling, the RTA for HiSeq lags a few cycles behind the current data acquisition cycle. The result is that base calling does not actually complete until about two hours after the run completes. In other words, the processing of data is not real time, but it is synchronous. In fact, if the data analysis falls behind, the instrument is paused in a safe state until it catches up. This is guaranteed to occur at least once in each run: after around five cycles the instrument will pause for about two hours while template generation (cluster identification) is performed. The large data rates also forced Illumina to rethink how they store and transfer data off the instrument. Gone are the QSEQ files, they are replaced by BCL files which are binary, per image, per cycle files that contain the base call and quality information. Because they are per image, per cycle files, they can be transferred cycle by cycle as they are generated (as opposed to QSEQ files which are read based). The BCL files are also more compact, requiring only 1 byte/base (B/b) as compared to QSEQ files which require about 2.5 B/b. In addition, the intensity files are also not transferred by default, so RTA output goes from 10 B/b to just 1 B/b. Thus, even though you are generating five times more sequence data than a GA, your RTA directory will actually be smaller (about 250 GB).</p>
<p>The HiSeq 2000 has a completely new instrument software user interface. The instrument user interface allows the operator to input data via a keyboard and mouse or a touch screen. Run configuration and setup are done via a wizard driven work flow. The setup and running of each flow cell is completely independent. This allows you to start the runs at different times, have different number of cycles for each flow cell, and even do an indexing run on one flow cell and a standard paired-end run on the other. The cycles of each flow cell will need to synchronize so that one is doing chemistry and the other data acquisition. Unfortunately, the current version of the instrument control software has no LIMS integration capabilities. Since this instrument is clearly targeting large genome centers, that is unfortunate.</p>
<p>The instrument software also has greatly enhanced real-time metric reporting as compared to the GA. In addition to the RTA reports, e.g., cluster density, intensity, focus, and quality scores, the standard reports typically generated after a GA run by GERALD, e.g., the Summary report, are generated cycle by cycle by RTA and made available to the operator via the instrument control software and remotely as HTML pages (there is also discussion of a smart phone application). <a href="http://en.wikipedia.org/wiki/Phi_X_174">Phi X</a> can be spiked into lanes to allow the software to generate error rate numbers (and Error and Perfect plots) on the fly as well. All in all, the reports are very similar to those people have become familiar with using the GA; they are just generated dynamically during the run. This will allow operators to more carefully observe their runs and take corrective action if something goes awry. All of the extra data processing and reports do not come without the requirement of additional computational horsepower. Don&#8217;t worry though, no iPAR is necessary. The HiSeq instrument computer is just beefier than its GA counterpart: two quad-core 64-bit processors, 48 GiB of RAM, and a 64-bit Microsoft Windows Vista operating system. For downstream analysis, Illumina will still offer their IlluminaCompute (turn-key sequence data analysis cluster) but also is strongly pushing cloud-based analysis solutions (specifically Amazon AWS). Illumina has altered GERALD so ELANDv2 can run using more than one process per lane. Alignment of 200 Gb of data using ELANDv2 takes about 30 hours using 64 cores.</p>
<p>The good and the bad of this instrument is that it is really just more of the same.  Illumina has taken the optics from iScan and combined that with the fluidics and chemistry of the GA. This means the system is more likely to &#8220;work&#8221; at launch than those of us dealing with new sequencing platforms are used to. It also means the data will be familiar (just more of it) and therefore will suffer from the same limitations (increasing errors with read length, short insert sizes). Shrinking from the bleeding edge of the GA in terms of cluster density and read length means the HiSeq likely has significant head room to increase well beyond 200 Gb/run. A quick back of the envelope calculation pushing the HiSeq to 600,000 clusters/mm<sup>2</sup> and 2&times;150 read lengths results in 450 Gb/run. (<em>Again, that is my rough calculation and not any sort of promise from Illumina.</em>) So, while it may be more of the same, it is likely that it will be a <strong>lot</strong> more of the same. The ability to sequence a tumor and normal genome from an individual in a single instrument run in about a week is really going to change the calculation (and economics) for cancer sequencing going forward.</p>
<p><strong>Update:</strong> The above text has been corrected to state that QSEQ files are about 2.5 B/b. It is the entire RTA output that is 10 B/b.</p>
<p><strong>Update2:</strong> I&#8217;ve added some links.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/hiseq-2000.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Head in the clouds</title>
		<link>http://www.politigenomics.com/2010/01/head-in-the-clouds.html</link>
		<comments>http://www.politigenomics.com/2010/01/head-in-the-clouds.html#comments</comments>
		<pubDate>Mon, 11 Jan 2010 01:16:20 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1819</guid>
		<description><![CDATA[It seems that due to my recent post, Bioinformatics and cloud computing, I have been labeled a cloud skeptic. While I don&#8217;t reject that label outright, I won&#8217;t accept it either. If I may label myself, I would call myself a cloud realist. My first piece of evidence is that at the end of my [...]]]></description>
			<content:encoded><![CDATA[<p>It seems that due to my recent post, <a href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html">Bioinformatics and cloud computing</a>, I have been labeled a cloud skeptic. While I don&#8217;t reject that label outright, I won&#8217;t accept it either. If I may label myself, I would call myself a cloud realist. My first piece of evidence is that at the end of my previous post I specifically state, &#8220;This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that&#8217;s the topic of a future post.&#8221; Unfortunately, this is not the future post to which that statement refers. The purpose of this post is to respond to some of the comments made on that post and around the web.</p>
<p>First, Ben Langmead said,<br />
<blockquote>My main comment is that you’re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers. That isn’t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn’t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.</p></blockquote>
<p> <a href="http://lingpipe-blog.com/">Bob Carpenter</a> then adds similar comments,<br />
<blockquote>To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost. For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)? How much space do they take up? The power for these beasts is not inconsiderable&hellip; My wife’s having trouble with her cluster at NYU because the building’s heating and cooling are both tied to the same faulty plumbing system; so even though it’s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two. Just like when the AC went out in the summer.</p></blockquote>
<p> Finally, <a href="http://www.warelab.org/blog/?p=307">Shiran Pasternak</a> over at <a href="http://www.warelab.org/blog/">Plant Tech Tonics</a> says<br />
<blockquote>What his numbers don&#8217;t take into account is the overhead of running a (possibly single node) cluster. While the fixed cost of purchasing computer equipment might be manageable, especially compared to chemical reagents, the operational costs of running a data center are substantial. Computer equipment needs to be continually serviced, be it for software, security, or kernel patches, or for unscheduled maintenance. In addition, energy costs for running a data center are high and expected to increase in the near future.</p></blockquote>
<p> Yes, it is true that the cost for the Dell server I quoted was just the purchase price. But the price I quoted for a computing core in our cluster, $500, was a <strong>fully loaded</strong> cost. As indicated in the post, that fully loaded cost includes server, rack, networking, electrical hookup, installation, 3-year warranty, etc. In other words, that is the cost to add a core to an existing cluster and was provided for those researchers that do have clusters (as opposed to the cost of the Dell which was provided for those who do not). It does not include system administration, electrical power, or cooling. In other words, it does not include ongoing costs, only capital costs. Why did I not include those ongoing costs? Because I did not need to. To maintain pace with the sequence data generated by an Illumina GA IIx or two, you don&#8217;t need any of that stuff! For electrical power and cooling, the addition of a few cores to an existing computing infrastructure is not going to make a substantive difference in power or cooling. For a lab without an existing computing cluster, all you need is the desk where you sit your bioinformatician. If you are at a normally operating university, the electrical power and cooling to office space is provided from the overhead your university takes out of your grants. If you operate a core facility at a university, then you simply work these costs into the fees you charge (their contributions are several orders of magnitude less than the sequencing reagents). What about labs who have lots of sequencers but not a lot of computing power? Well, that&#8217;s bad planning and allocation of assets; no one can help you.</p>
<p>Systems administration costs are a similar story. For researchers with existing clusters, the addition of a few cores to keep pace with a few Illumina instruments will not require them to hire additional IT staff. For researchers without a cluster, I posit that it does not take more system administration costs to manage a single desktop workstation than it would to manage a cluster of Amazon EC2 nodes. Amazon EC2 provides <a href="http://aws.amazon.com/ec2/#instance">virtual hardware</a> and a stock installation of an <a href="http://aws.amazon.com/ec2/#os">operating system</a>. Aside from the fact that you can purchase computers from Dell with Red Hat Enterprise [GNU/]Linux, any bioinformatician worth her salt (or any 12-year-old for that matter) can install Ubuntu on a computer. Just as the Dell customer will have to install their bioinformatics tools on the systems, so too will the Amazon EC2 customer; except they will need to install them on <em>all</em> the nodes they have rented. Regarding maintaining security patches and other updates, that is also dead simple in Ubuntu (although I will readily admit that just because something is easy, it does not necessarily follow that people will do it). The bottom line is that maintaining a workstation used for day-to-day activities and analyzing data from one or two Illumina instruments is more likely to be within the capabilities of a bioinformatician than setting up and maintaining an Amazon EC2 cluster.</p>
<p>Another point brought up in the above comments was reliability of the systems. One of the arguments in this area is that with your own hardware, you are responsible for maintaining the equipment while with Amazon EC2, they manage all the hardware. This is not really the case, though. All of the costs I have quoted included a 3-year warranty with on-site service. The reliability argument also involves downtime. If your local systems go down, whether for hardware failures, network outages, power outages, or Armageddon, it is true that you will not be able to do any computations on them, but you&#8217;re also not going to be able to access your EC2 systems and those EC2 systems will not be able to pull data from your systems (and in the case of Armageddon, Amazon EC2 will probably also be down).</p>
<p>So, that leaves us with the question, what would the fully loaded cost of the Dell workstation be, and what is the break even point with Amazon EC2? The cost of the quad-core system was roughly $1700. You only need one core for data analysis. Since you need to buy your bioinformatician a workstation anyway and it needs an operating system, bioinformatics software, power, and cooling, we&#8217;ll ignore those costs. So the purchase price becomes the <strong>fully loaded</strong> costs for comparison purposes. Assuming you would buy your bioinformatician a dual-core systems with 1 GiB of RAM (Firefox uses a lot of memory) which costs about $1000, the incremental cost of getting a machine capable of analyzing data is $700; the incremental cost per computing core is only $350. That dollar amount will buy you less than three genomes worth of analysis on Amazon EC2.</p>
<p>Bob Carpenter had a few other points worth addressing: viruses and running analysis multiple times. I would argue that the former is an issue regardless of where you run your analysis. Plus, for the GNU/Linux systems we are talking about in these scenarios, viruses are much less of an issue than they are for Microsoft Windows. Regarding running analysis multiple times, sure it would mean you may need more than one core to keep up, but it also means you are going to pay Amazon a lot more too. With the quad-core system quoted above, you have a whole extra core (two for the desktop, one for the single pass analysis, and one extra) to spill over into at no cost.</p>
<p>Before I close, I would like to thank all the commenters for raising the above points. All of the issues they raised are very important to consider when jumping into the next-generation informatics space. They also made it clear that my previous post was not as thorough as I thought it was when I hit the publish button. In addition to the excellent comments I quoted above, there were also several other good points regarding software in the comments of the previous post that I hope to incorporate in future posts (and hopefully this post will generate a few comments as well).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/head-in-the-clouds.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Bioinformatics and cloud computing</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html</link>
		<comments>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html#comments</comments>
		<pubDate>Tue, 24 Nov 2009 19:54:22 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728</guid>
		<description><![CDATA[From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month&#8217;s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg&#8216;s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera [...]]]></description>
			<content:encoded><![CDATA[<p>From the <a href="http://www.mcs.anl.gov/events/workshops/sc09-sysbio/index.php">Using clouds for parallel computations in systems biology</a> workshop at the recent <a href="http://sc09.supercomputing.org/">SC09 conference</a> (<a href="http://www.genomeweb.com/blog/cloud-bio-computing-sc09">Informatics Iron writeup</a>) to last month&#8217;s <a href="http://www.genomeweb.com/informatics/genome-informatics-speakers-say-second-gen-sequencing-makes-giddy-times-bioinfor">Genome Informatics meeting</a>, everyone in bioinformatics is talking about cloud computing these days. Last week <a href="http://genome.fieldofscience.com/">Steven Salzberg</a>&#8216;s <a href="http://www.cbcb.umd.edu/~salzberg/">group</a> published a paper on their Crossbow tool entitled <a href="http://genomebiology.com/2009/10/11/R134">Searching for SNPs with cloud computing</a> (<a href="http://www.cloudera.com/blog/2009/10/15/analyzing-human-genomes-with-hadoop/">Cloudera blog post on Crossbow</a>). In the paper the authors describe how they were able to analyze the human sequence data <a href="http://www.nature.com/nature/journal/v456/n7218/abs/nature07484.html">published last year by BGI</a> using <a href="http://aws.amazon.com/ec2/">Amazon EC2</a>.  Specifically, they have developed an alignment (<a href="http://bowtie-bio.sourceforge.net/index.shtml">bowtie</a>) and SNP detection (<a href="http://soap.genomics.org.cn/soapsnp.html">SoapSNP</a>) pipeline that is executed in parallel across a cluster using the <a href="http://hadoop.apache.org/">Hadoop</a> framework (a <a href="http://fsf.org/">free software</a> implementation of <a href="http://labs.google.com/papers/mapreduce.html">Google&#8217;s MapReduce</a> framework).  Using a 40-node, 320-core EC2 cluster, they were able to analyze 38&times; coverage sequence data in about three hours. The whole analysis, including data transfer and storage on <a href="http://aws.amazon.com/s3/">Amazon S3</a>, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr&#8217;s <a href="http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/">HPCInfo post<a/> and more detail on the SNP detection on Dan Koboldt&#8217;s <a href="http://www.massgenomics.org/2009/11/crossbow-ngs-informatics-in-the-cloud.html">Mass Genomics post</a>.</p>
<p>For analyzing a single genome, you really can&#8217;t beat that price.  Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000.  It&#8217;s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38&times; coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core&middot;hours to align, so a whole run&#8217;s (eight lanes&#8217;) worth of data would take about 80 core&middot;hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core&middot;hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn&#8217;t buy just <em>one</em> core. Checking over at the <a href="http://www.dell.com/us/en/highered/df.aspx?refid=df&#038;s=hied&#038;cs=RC956904&#038;~ck=mn">Dell Higher Education web site</a>, you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the <a href="http://aws.amazon.com/ec2/#instance">Amazon EC2 Extra Large Instance</a> used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core&#8217;s (25%) of that workstation&#8217;s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to <a href="http://en.wikipedia.org/wiki/Burrows-Wheeler_transform">Burrows-Wheeler Transform</a> aligners like bowtie and <a href="http://bio-bwa.sourceforge.net/">bwa</a>). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using  the entire cost of the Dell workstation (even though you require less than 25% of its computational capacity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.</p>
<p>These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the &#8220;few genomes&#8221; bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics table update</title>
		<link>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html</link>
		<comments>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html#comments</comments>
		<pubDate>Mon, 05 Oct 2009 14:45:02 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1606</guid>
		<description><![CDATA[I have made some updates to the Next-Generation Sequencing Informatics table. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking [...]]]></description>
			<content:encoded><![CDATA[<p>I have made some updates to the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a>. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking at you drd).</p>
<p><strong>Update:</strong> I received some SOLiD 3 number from Nicholas Socci (thanks Nicholas!).</p>
<p><strong>Update2:</strong> I received a fuller set of numbers from drd and the SOLiD 3 column is complete (thanks drd!).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Expansion</title>
		<link>http://www.politigenomics.com/2009/10/expansion.html</link>
		<comments>http://www.politigenomics.com/2009/10/expansion.html#comments</comments>
		<pubDate>Fri, 02 Oct 2009 19:04:04 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[data center]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=114</guid>
		<description><![CDATA[The Genome Data Center has received a Gold LEED Certification from the U.S. Green Building Council. This is in addition to the Keystone Award from the St. Louis Association of General Contractors. It is quite an achievement for a power hungry data center to receive a LEED certification, much more a Gold Certification, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.usgbc.org/DisplayPage.aspx?CMSPageID=1991"><img src="http://www.politigenomics.com/wp-content/uploads/2009/10/leed.png" alt="LEED Certification" title="LEED Certification" width="275" height="393" class="alignright size-full wp-image-1581" /></a></p>
<p>The Genome Data Center has received a <a href="http://www.usgbc.org/DisplayPage.aspx?CategoryID=19">Gold LEED Certification</a> from the <a href="http://www.usgbc.org/">U.S. Green Building Council</a>. This is in addition to the <a href="http://www.politigenomics.com/2008/11/keystone-award-for-data-center.html">Keystone Award</a> from the St. Louis Association of General Contractors. It is quite an achievement for a power hungry data center to receive a LEED certification, much more a Gold Certification, but the WUSM Design and Construction team along with the architects, engineers, and contractors were able to pull it off.</p>
<p>Recently the final phase of construction at the Genome Data Center was completed. The initial build out had enough power and cooling for about 40 racks of equipment. Now at full capacity, the data center is capable of supplying 4 MW of power (about the amount used by 800 homes on a hot day) and the requisite cooling to the equipment housed within it. This will support over 100 racks worth of high-density computational (blades) and storage equipment and its supporting infrastructure (chilled water plants, air handlers, humidity control, office space, etc.). The electrical system is completely redundant, all the way to the double-ended substation of our electrical utility. That means even if we lose one entire electrical feed, we can still operate on utility power. If we lose both electrical feeds, we have battery and fly-wheel UPS systems to carry us until the two 2 MW diesel generators start (under 10 seconds). <a href="http://www.flickr.com/photos/ddgenome/3730196210/" title="2 MW diesel generator"><img src="http://farm4.static.flickr.com/3428/3730196210_cdb8bfeb18.jpg" width="452" height="339" alt="generator" /></a> The building is about 1480 m<sup>2</sup> while the actual data center is about 288 m<sup>2</sup> (as they shrink computing equipment, the required electrical and cooling equipment keeps increasing in size). The data center is arranged in a standard hot aisle/cold aisle layout with cooling delivered from below through floor grates (perf plates did not provide enough airflow) via a 1.2 m raised floor. <a href="http://www.flickr.com/photos/ddgenome/3730194974/" title="data center cold aisle"><img src="http://farm3.static.flickr.com/2565/3730194974_8fddfc00b1.jpg" width="452" height="339" alt="cold aisle" /></a> We currently have about 3,000 cores in our computational cluster and over 3 PB (3,000,000 GB) of storage online. When full of equipment in a few years, the data center will likely house tens of thousands of cores and on the order of 100 PB of storage.</p>
<p>There are more pictures of the Genome Data Center on <a href="http://www.flickr.com/photos/22486047@N03/sets/72157603633991423/">Flickr</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/expansion.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Illumina cluster needs</title>
		<link>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html</link>
		<comments>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html#comments</comments>
		<pubDate>Thu, 18 Jun 2009 16:13:36 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[LSF]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1254</guid>
		<description><![CDATA[There is an interesting thread over at the Solexa Google Group about the IT infrastructure needed to support an Illumina Genome Analyzer (GA). The discussion focuses mostly on the cluster and, to a lesser extent, the storage and network required to operate the instrument and generate sequence data (primary analysis). At The Genome Center, we [...]]]></description>
			<content:encoded><![CDATA[<p>There is an interesting thread over at the <a href="http://groups.google.com/group/solexa?hl=en">Solexa Google Group</a> about the <a href="http://groups.google.com/group/solexa/browse_thread/thread/38ff88dcf5f5df17?hl=en">IT infrastructure needed to support an Illumina Genome Analyzer (GA)</a>. The discussion focuses mostly on the cluster and, to a lesser extent, the storage and network required to operate the instrument and generate sequence data (primary analysis). At <a href="http://genome.wustl.edu/">The Genome Center</a>, we use Platform LSF HPC as our batch scheduler and currently use <a href="http://www.politigenomics.com/2008/03/illumina-genome-analyzer-pipeline-and.html">lsgmake-gap</a> to execute the GAPipeline (the Illumina primary analysis software) in parallel on our cluster. However, GAPipeline is developed and tested by Illumina on a cluster managed by <a href="http://www.sun.com/software/sge/">Sun Grid Engine (SGE)</a>, which is <a href="http://gridengine.sunsource.net/">free/open source software</a>. This situation creates some headaches for us because as the internals of GAPipeline change, we need to <a href="http://www.politigenomics.com/2009/02/lsgmake-gap-for-gapipeline-13.html">regularly update lsgmake-gap</a> so that GAPipeline will continue to run properly on our cluster. Several years ago when we migrated to LSF, the driving force for the selection of LSF was that it was the only batch scheduler that could handle scheduling 50,000+ jobs at a time (a regular occurrence on our cluster). Fortunately, users may no longer have to choose between scalability and ease of use when running GAPipeline as part of their larger computational needs. Chris Dagdigian, who writes the <a href="http://gridengine.info/">gridengine.info blog</a>, had this information about the current capabilities of SGE.</p>
<blockquote><ol>
<li>SGE 6.2 design goal includes supporting a single array job with 500,000 tasks and hundreds of thousands of concurrent jobs</li>
<li>People have been running hundreds of thousands of SGE jobs per week since the SGE 5.3 days many years ago
<li>I personally know of several sites pushing hundreds of thousands of heavy SGE jobs per week through their systems right now
<li>SGE 6.2 runs a 62,000 core cluster in Texas (RANGER) and has been for some time</li>
</ol>
<p>&#8220;tens of thousands of jobs&#8221; is actually pretty easy with Grid Engine and has been for some time, scaling issues encountered in this range have more to do with bad spooling decisions, filesystem design and occasionally an overwhelmed qmaster host. The developers have worked quite a bit this year to improve threading performance, reduce memory footprints and remove things like external RSH methods that consumed system resources like filehandles and TCP ports etc.</p>
<p>This is especially evident in the SGE 6.2  and 6.2u1 release series where speed and scaling were specifically addressed as part of the design effort (6.2u3 and 6.3 will introduce new features). This is the reason why the <em>SGE scheduler is now a thread within the qmaster</em> &#8211; one of the more obvious user-visible changes made recently. (emphasis mine &#8211; dd)</p>
<p>There are many reasons why one would chose between LSF vs SGE (I have used both for years now) but scaling is not one of the significant selection factors. Features, price, APIs and quality of documentation are far more important along with community adoption/support.</p>
</blockquote>
<p>I would guess breaking out the scheduler into its own thread is a major factor in SGE&#8217;s ability to manage so many jobs. This was the major deficiency of SGE and other batch schedulers we tested at the time. Several systems designed their schedulers to automatically run through the list of jobs a certain intervals. With a lot of jobs in the queue, the scheduler would not finish its previous traversal before the new one was scheduled to start. Depending on the design implementation this meant that either the original scheduling was killed and the scheduler never processed some jobs or that scheduler threads kept spawning until the resources were exhausted on the master node (that&#8217;s bad).</p>
<p>(A couple asides here, since GAPipeline is built on Makefile&#8217;s, another option that came up in the thread was parallel execution across an LSF cluster using <a href="http://distmake.sourceforge.net/pmwiki/pmwiki.php">distmake</a>. Because of <a href="http://hpcinfo.com/">our interest</a> in <a href="http://www.opensciencegrid.org/">grid computing</a>, we are currently investigating replacing LSF with <a href="http://www.cs.wisc.edu/condor/">Condor</a>.)</p>
<p>Of course, with the roll out of SCS2.4 with RTA (real-time analysis), most of the primary analysis is now done on the instrument control computer. Thus, all of this talk about the requirements to produce sequence from the machine are made much less important. Now there is only one stage of the pipeline, the alignment and reporting (called GERALD), now run off the instrument computer. The most computationally intensive part of this stage of the pipeline is the alignment (ELAND and its post-processing) and it can only be made parallel on a per lane basis, i.e., eight ways.</p>
<p>Of course, there is also the specter of the requirements for sequence analysis at Illumina GA IIx scale, but that&#8217;s a whole other post&hellip;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

