<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; Illumina</title>
	<atom:link href="http://www.politigenomics.com/tag/illumina/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Next-Generation Sequencing Informatics Update</title>
		<link>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html</link>
		<comments>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:55:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2143</guid>
		<description><![CDATA[I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;somewhat dated comparison of next-generation sequencing platforms.&#8221; Just [...]]]></description>
			<content:encoded><![CDATA[<p>I updated the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a> a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the <a href="http://www.illumina.com/systems/genome_analyzer_iix.ilmn">Illumina GA IIx</a>. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;<a href="http://sidesandassociates.com/blog/2010/01/01/the-business-of-sequencing/">somewhat dated comparison of next-generation sequencing platforms</a>.&#8221; Just to clarify, this table represents <em>average</em> throughput for <em>production</em> systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Seq-o-matic &#8217;76</title>
		<link>http://www.politigenomics.com/2010/02/seq-o-matic-76.html</link>
		<comments>http://www.politigenomics.com/2010/02/seq-o-matic-76.html#comments</comments>
		<pubDate>Wed, 03 Feb 2010 22:04:12 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[Illumina]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1944</guid>
		<description><![CDATA[Soon after Illumina announced its HiSeq 2000, it also announced the GA IIx&#8216;s little brother, the GA IIe. The IIe will produce about half as much data as the IIx, but no one seems to know exactly how this is done. The unit is cheaper than the IIx, $250,000 for the IIe compared to $400,000 [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hulu.com/watch/19046/saturday-night-live-bassomatic"><img src="http://www.politigenomics.com/wp-content/uploads/2010/02/bassomatic.jpg" alt="Bass-o-matic" title="Bass-o-matic" width="250" height="344" class="alignright size-full wp-image-2090" /></a></p>
<p>Soon after Illumina announced its <a href="http://www.politigenomics.com/2010/01/hiseq-2000.html">HiSeq 2000</a>, it also announced the <a href="http://www.illumina.com/systems/genome_analyzer_iix.ilmn">GA IIx</a>&#8216;s little brother, the <a href="http://www.illumina.com/systems/genome_analyzer.ilmn">GA IIe</a>. The IIe will produce about half as much data as the IIx, but no one seems to know exactly how this is done. The unit is cheaper than the IIx, $250,000 for the IIe compared to $400,000 (I think) for the IIx, but is upgradeable to the IIx. So perhaps the optics system is cheaper. But the run time is the same, so it seems like the optics would need to be about the same (the older optics system was slower). The IIe seems to use the <a href="http://www.illumina.com/systems/genome_analyzer.ilmn#workflow_specs">same kits</a> as the <a href="http://www.illumina.com/systems/genome_analyzer.ilmn#workflow_specs">GA IIx</a>. That seems odd to me because the consumables cost is typically the largest part of the per run cost. So while you will save on instrument depreciation costs per run, those savings disappear when considering cost per Gb. Another way to look at it is that <em>if</em> reagent costs are indeed the same, it makes no sense to buy two GA IIe instruments. You would be much better off buying one GA IIx. It is only if your lab has a sequencing workload that cannot utilize a GA IIx full time that a GA IIe makes economic sense.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/seq-o-matic-76.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Life finds a way</title>
		<link>http://www.politigenomics.com/2010/01/life-finds-a-way.html</link>
		<comments>http://www.politigenomics.com/2010/01/life-finds-a-way.html#comments</comments>
		<pubDate>Fri, 29 Jan 2010 22:43:25 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2060</guid>
		<description><![CDATA[Earlier this week Life Technologies announced the next revision of their SOLiD platform, SOLiD 4. I don&#8217;t have all the details that I had for the Illumina HiSeq 2000, but here is what I do know: the system will produced 100 Gb of alignable sequence data on two slides per 14 day run. The sequence [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.appliedbiosystems.com/solid4"><img alt="SOLiD 4" src="http://www3.appliedbiosystems.com/cms/groups/portal/documents/web_content/cms_076478.jpg" title="SOLiD 4" class="alignright" width="200" height="205" /></a></p>
<p>Earlier this week <a href="http://www.lifetechnologies.com/">Life Technologies</a> announced the next revision of their SOLiD platform, <a href="http://www.lifetechnologies.com/life-technologies-brings-genomic-sequencing-closer-clinic.html">SOLiD 4</a>. I don&#8217;t have all the details that I had for the <a href="http://www.politigenomics.com/2010/01/hiseq-2000.html">Illumina HiSeq 2000</a>, but here is what I do know: the system will produced 100 Gb of alignable sequence data on two slides per 14 day run. The sequence data will be paired-end, 50&times;35 base reads. Reagent costs for each run will be about $6,000. Since you need about 100 Gb of sequence to sequence a human genome, you&#8217;re looking at about $6000 in reagent costs per human genome. They also indicated that capacity for the instrument will increase to 300 Gb per run and the cost for reagents per human genome will be less than $3000 by the end of 2010. In comparison, the Illumina HiSeq 2000 reagent costs will be about $10,000 per human genome at its release with, by <em>my</em> calculations, a path to about $4000 per human genome (I have no idea what the time frame might be to reach the end of that path, but given this announcement by Life, it will likely be aggressive). You have to love the way competition drives down costs. Similar to Illumina&#8217;s announcement of a big HiSeq 2000 purchase at its announcement, Life announced that <a href="http://www.lifetechnologies.com/life-technologies-and-ignite-institute-partner-create-largest-next-generation-genomic-sequencing-fac">Ignite Institute would acquire 100 SOLiD 4 instruments</a> as part of partnership with Life. Life also announced a major bioinformatics investment program as well as a physician education program through their Foundation.</p>
<p><strong>Update:</strong> According to the press release, Ignite is &#8220;acquiring&#8221;, not purchasing, the instruments in &#8220;partnership&#8221; with Life. So it appears this is not an outright purchase of a large number of instruments. I have updated the text in the post to be more accurate.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/life-finds-a-way.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>HiSeq 2000</title>
		<link>http://www.politigenomics.com/2010/01/hiseq-2000.html</link>
		<comments>http://www.politigenomics.com/2010/01/hiseq-2000.html#comments</comments>
		<pubDate>Wed, 13 Jan 2010 00:48:53 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1914</guid>
		<description><![CDATA[Today Illumina announced their new, high-throughput sequencing instrument, the HiSeq 2000. Sure, the name isn&#8217;t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30&#215; coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.illumina.com/systems/hiseq_2000.ilmn"><img alt="" src="http://www.illumina.com/images/systems/hiseq_2000.jpg" title="HiSeq 2000" class="alignright" width="265" height="290" /></a></p>
<p>Today Illumina announced their new, high-throughput sequencing instrument, the <a href="http://www.illumina.com/systems/hiseq_2000.ilmn">HiSeq 2000</a>. Sure, the name isn&#8217;t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30&times; coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current second-generation sequencing technologies? What it doesn&#8217;t do is change the fundamentals of the Illumina sequencing technology. The HiSeq 2000 uses <a href="http://www.illumina.com/technology/sequencing_technology.ilmn">Sequencing By Synthesis (SBS)</a>, just like the Genome Analyzer (GA). In fact, it actually dials down the current SBS state of the art, using lower cluster densities (350,000 &#8211; 400,000 clusters/mm<sup>2</sup>) and read lengths (2&times;100) than the latest GA IIx release (600,000 clusters/mm<sup>2</sup> and 2&times;125). (Current tiles are 0.5293 mm<sup>2</sup>, so 600,000 clusters/mm<sup>2</sup> equate to about 318,000 clusters/tile.) The throughput improvement comes from two major factors: increased data collection <em>area</em> and <em>rate</em>. The HiSeq 2000 has two 8-lane flow cells, as compared to the single flow cell on the GA, and images both the top and bottom surfaces of the flow cell. In addition, the imaging area of the HiSeq 2000 flow cell is larger than the GA flow cell&#8217;s. This all adds up to a more than five-fold increase in surface area to collect data from on the HiSeq 2000. As you know if you operate a GA, the imaging part of each cycle takes up more time than the chemistry portion. Thus, to run two flow cells on the same instrument, Illumina needed to speed up data acquisition so that it was at least as fast as the chemistry stage so that one flow cell could be doing chemistry while the other was imaging (like the <a href="http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiD-System-Sequencing-C/index.htm">SOLiD</a> platform from Life Technologies). To do this, they used their experience with systems like iScan and its <a href="http://en.wikipedia.org/wiki/Time_Delay_and_Integration">Time Delay and Integration (TDI)</a> line imaging technology, and completely replaced the entire optics system. The GA performs area imaging to collect its image data. The flow cell is moved, the camera focuses, and four images (tiles) are taken (one for each base). The flow cell is then moved again and the process repeated. For the current GA IIx, each of the eight lanes is imaged at 120 positions (in a 2&times;60 grid) resulting in 480 images per lane per cycle. The HiSeq 2000 scans a 2048 pixel wide swath down one side of a lane and then comes back and scans the swath on the other side of the lane. This is then repeated for the other surface in the lane and then across all the lanes. Because of this continuous data collection, there are four cameras in the system rather than one. This line scanning system is able to collect data at a rate of 50 MB/s, as compared to about 8 MB/s in the GA IIx. When you put all of this together, the HiSeq 2000 is able to generate about 200 Gb of sequence from over 1 billion clusters in the form of 2&times;100 base reads from two flow cells in about eight days with error rates (1-2%) comparable to current GA IIx data (as one would expect since both use SBS). Illumina actually already has data from &#8220;production&#8221; instruments on several human genomes.</p>
<p>Because of the five-fold increase in sequence data generation rate (25 Gb/day  versus 5 Gb/day for the GA IIx), Illumina needed to rethink how it processed and stored all the data. Normal hard drives cannot write four 625 MB images every 30 seconds. As such, images are not written to disk by default; they are processed in memory by the instrument control software (as opposed to the GA where image are written to the disk and processed by RTA which also does the base calling). You can save images if you want, but you will need 32 TB of disk space per run and it will slow down your run. Like the most recent version of RTA for the GA IIx, you can save thumbnail images (without penalty) to aid in troubleshooting (the thumbnails, of course, cannot be used for off-instrument analysis). Because of the need to incorporate phasing and pre-phasing information when base calling, the RTA for HiSeq lags a few cycles behind the current data acquisition cycle. The result is that base calling does not actually complete until about two hours after the run completes. In other words, the processing of data is not real time, but it is synchronous. In fact, if the data analysis falls behind, the instrument is paused in a safe state until it catches up. This is guaranteed to occur at least once in each run: after around five cycles the instrument will pause for about two hours while template generation (cluster identification) is performed. The large data rates also forced Illumina to rethink how they store and transfer data off the instrument. Gone are the QSEQ files, they are replaced by BCL files which are binary, per image, per cycle files that contain the base call and quality information. Because they are per image, per cycle files, they can be transferred cycle by cycle as they are generated (as opposed to QSEQ files which are read based). The BCL files are also more compact, requiring only 1 byte/base (B/b) as compared to QSEQ files which require about 2.5 B/b. In addition, the intensity files are also not transferred by default, so RTA output goes from 10 B/b to just 1 B/b. Thus, even though you are generating five times more sequence data than a GA, your RTA directory will actually be smaller (about 250 GB).</p>
<p>The HiSeq 2000 has a completely new instrument software user interface. The instrument user interface allows the operator to input data via a keyboard and mouse or a touch screen. Run configuration and setup are done via a wizard driven work flow. The setup and running of each flow cell is completely independent. This allows you to start the runs at different times, have different number of cycles for each flow cell, and even do an indexing run on one flow cell and a standard paired-end run on the other. The cycles of each flow cell will need to synchronize so that one is doing chemistry and the other data acquisition. Unfortunately, the current version of the instrument control software has no LIMS integration capabilities. Since this instrument is clearly targeting large genome centers, that is unfortunate.</p>
<p>The instrument software also has greatly enhanced real-time metric reporting as compared to the GA. In addition to the RTA reports, e.g., cluster density, intensity, focus, and quality scores, the standard reports typically generated after a GA run by GERALD, e.g., the Summary report, are generated cycle by cycle by RTA and made available to the operator via the instrument control software and remotely as HTML pages (there is also discussion of a smart phone application). <a href="http://en.wikipedia.org/wiki/Phi_X_174">Phi X</a> can be spiked into lanes to allow the software to generate error rate numbers (and Error and Perfect plots) on the fly as well. All in all, the reports are very similar to those people have become familiar with using the GA; they are just generated dynamically during the run. This will allow operators to more carefully observe their runs and take corrective action if something goes awry. All of the extra data processing and reports do not come without the requirement of additional computational horsepower. Don&#8217;t worry though, no iPAR is necessary. The HiSeq instrument computer is just beefier than its GA counterpart: two quad-core 64-bit processors, 48 GiB of RAM, and a 64-bit Microsoft Windows Vista operating system. For downstream analysis, Illumina will still offer their IlluminaCompute (turn-key sequence data analysis cluster) but also is strongly pushing cloud-based analysis solutions (specifically Amazon AWS). Illumina has altered GERALD so ELANDv2 can run using more than one process per lane. Alignment of 200 Gb of data using ELANDv2 takes about 30 hours using 64 cores.</p>
<p>The good and the bad of this instrument is that it is really just more of the same.  Illumina has taken the optics from iScan and combined that with the fluidics and chemistry of the GA. This means the system is more likely to &#8220;work&#8221; at launch than those of us dealing with new sequencing platforms are used to. It also means the data will be familiar (just more of it) and therefore will suffer from the same limitations (increasing errors with read length, short insert sizes). Shrinking from the bleeding edge of the GA in terms of cluster density and read length means the HiSeq likely has significant head room to increase well beyond 200 Gb/run. A quick back of the envelope calculation pushing the HiSeq to 600,000 clusters/mm<sup>2</sup> and 2&times;150 read lengths results in 450 Gb/run. (<em>Again, that is my rough calculation and not any sort of promise from Illumina.</em>) So, while it may be more of the same, it is likely that it will be a <strong>lot</strong> more of the same. The ability to sequence a tumor and normal genome from an individual in a single instrument run in about a week is really going to change the calculation (and economics) for cancer sequencing going forward.</p>
<p><strong>Update:</strong> The above text has been corrected to state that QSEQ files are about 2.5 B/b. It is the entire RTA output that is 10 B/b.</p>
<p><strong>Update2:</strong> I&#8217;ve added some links.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/hiseq-2000.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Bioinformatics and cloud computing</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html</link>
		<comments>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html#comments</comments>
		<pubDate>Tue, 24 Nov 2009 19:54:22 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728</guid>
		<description><![CDATA[From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month&#8217;s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg&#8216;s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera [...]]]></description>
			<content:encoded><![CDATA[<p>From the <a href="http://www.mcs.anl.gov/events/workshops/sc09-sysbio/index.php">Using clouds for parallel computations in systems biology</a> workshop at the recent <a href="http://sc09.supercomputing.org/">SC09 conference</a> (<a href="http://www.genomeweb.com/blog/cloud-bio-computing-sc09">Informatics Iron writeup</a>) to last month&#8217;s <a href="http://www.genomeweb.com/informatics/genome-informatics-speakers-say-second-gen-sequencing-makes-giddy-times-bioinfor">Genome Informatics meeting</a>, everyone in bioinformatics is talking about cloud computing these days. Last week <a href="http://genome.fieldofscience.com/">Steven Salzberg</a>&#8216;s <a href="http://www.cbcb.umd.edu/~salzberg/">group</a> published a paper on their Crossbow tool entitled <a href="http://genomebiology.com/2009/10/11/R134">Searching for SNPs with cloud computing</a> (<a href="http://www.cloudera.com/blog/2009/10/15/analyzing-human-genomes-with-hadoop/">Cloudera blog post on Crossbow</a>). In the paper the authors describe how they were able to analyze the human sequence data <a href="http://www.nature.com/nature/journal/v456/n7218/abs/nature07484.html">published last year by BGI</a> using <a href="http://aws.amazon.com/ec2/">Amazon EC2</a>.  Specifically, they have developed an alignment (<a href="http://bowtie-bio.sourceforge.net/index.shtml">bowtie</a>) and SNP detection (<a href="http://soap.genomics.org.cn/soapsnp.html">SoapSNP</a>) pipeline that is executed in parallel across a cluster using the <a href="http://hadoop.apache.org/">Hadoop</a> framework (a <a href="http://fsf.org/">free software</a> implementation of <a href="http://labs.google.com/papers/mapreduce.html">Google&#8217;s MapReduce</a> framework).  Using a 40-node, 320-core EC2 cluster, they were able to analyze 38&times; coverage sequence data in about three hours. The whole analysis, including data transfer and storage on <a href="http://aws.amazon.com/s3/">Amazon S3</a>, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr&#8217;s <a href="http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/">HPCInfo post<a/> and more detail on the SNP detection on Dan Koboldt&#8217;s <a href="http://www.massgenomics.org/2009/11/crossbow-ngs-informatics-in-the-cloud.html">Mass Genomics post</a>.</p>
<p>For analyzing a single genome, you really can&#8217;t beat that price.  Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000.  It&#8217;s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38&times; coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core&middot;hours to align, so a whole run&#8217;s (eight lanes&#8217;) worth of data would take about 80 core&middot;hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core&middot;hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn&#8217;t buy just <em>one</em> core. Checking over at the <a href="http://www.dell.com/us/en/highered/df.aspx?refid=df&#038;s=hied&#038;cs=RC956904&#038;~ck=mn">Dell Higher Education web site</a>, you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the <a href="http://aws.amazon.com/ec2/#instance">Amazon EC2 Extra Large Instance</a> used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core&#8217;s (25%) of that workstation&#8217;s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to <a href="http://en.wikipedia.org/wiki/Burrows-Wheeler_transform">Burrows-Wheeler Transform</a> aligners like bowtie and <a href="http://bio-bwa.sourceforge.net/">bwa</a>). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using  the entire cost of the Dell workstation (even though you require less than 25% of its computational capacity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.</p>
<p>These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the &#8220;few genomes&#8221; bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>What&#8217;s in an Illumina GA run directory?</title>
		<link>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html</link>
		<comments>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html#comments</comments>
		<pubDate>Wed, 28 Oct 2009 21:46:40 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1660</guid>
		<description><![CDATA[One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a lot of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This [...]]]></description>
			<content:encoded><![CDATA[<p>One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a <em>lot</em> of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This large number of files and the parallel access of these files by large computational clusters tends to give most storage solutions great difficulty.</p>
<p>So what, exactly, is in an Illumina run directory? Well, to get breakdowns of file statistics there is a nifty little tool called <a href="http://www.pdsi-scidac.org/fsstats/">fsstats</a>. It is just a simple Perl script that crawls through a directory stat&#8217;ing files and reporting metrics. For example, when you run it on an Illumina GA IIx 2&times;100, high cluster density run after the primary analysis has completed, you get the following information about the distribution of file sizes. (I have rearranged and condensed the information to make it fit.)</p>
<pre style="font-size: x-small; line-height: normal;">
total 7.46 TB used to store 7.46 TB user data, overhead 0.04%
  count=991227 avg=8076.50 KB
  min=0.00 KB max=13128679.30 KB
           size range    count   %tot  %tot cum       total size   %tot  %tot cum
[       0-       2 KB):   4019 ( 0.41) (  0.41)       3009.03 KB ( 0.00) (  0.00)
[       2-       4 KB):      2 ( 0.00) (  0.41)          6.99 KB ( 0.00) (  0.00)
[       4-       8 KB):    981 ( 0.10) (  0.50)       5964.82 KB ( 0.00) (  0.00)
[       8-      16 KB): 193351 (19.51) ( 20.01)    2588619.88 KB ( 0.03) (  0.03)
[      16-      32 KB):   2656 ( 0.27) ( 20.28)      58586.79 KB ( 0.00) (  0.03)
[      32-      64 KB):    901 ( 0.09) ( 20.37)      31369.79 KB ( 0.00) (  0.03)
[      64-     128 KB):   2893 ( 0.29) ( 20.66)     303872.38 KB ( 0.00) (  0.04)
[     128-     256 KB):      2 ( 0.00) ( 20.66)        345.34 KB ( 0.00) (  0.04)
[     256-     512 KB):      4 ( 0.00) ( 20.66)       1222.53 KB ( 0.00) (  0.04)
[     512-    1024 KB):      1 ( 0.00) ( 20.66)        622.26 KB ( 0.00) (  0.04)
[    1024-    2048 KB):      2 ( 0.00) ( 20.66)       3199.89 KB ( 0.00) (  0.04)
[    2048-    4096 KB):     12 ( 0.00) ( 20.66)      41779.69 KB ( 0.00) (  0.04)
[    4096-    8192 KB): 776654 (78.35) ( 99.02) 5863161178.18 KB (73.24) ( 73.28)
[   16384-   32768 KB):     21 ( 0.00) ( 99.02)     487156.46 KB ( 0.01) ( 73.28)
[   32768-   65536 KB):   3856 ( 0.39) ( 99.41)  163552521.17 KB ( 2.04) ( 75.32)
[   65536-  131072 KB):   3825 ( 0.39) ( 99.79)  307535341.32 KB ( 3.84) ( 79.17)
[  131072-  262144 KB):    133 ( 0.01) ( 99.81)   32458046.12 KB ( 0.41) ( 79.57)
[  262144-  524288 KB):   1787 ( 0.18) ( 99.99)  658830514.46 KB ( 8.23) ( 87.80)
[ 2097152- 4194304 KB):     16 ( 0.00) ( 99.99)   47898262.36 KB ( 0.60) ( 88.40)
[ 4194304- 8388608 KB):     64 ( 0.01) (100.00)  432084134.39 KB ( 5.40) ( 93.80)
[ 8388608-16777216 KB):     47 ( 0.00) (100.00)  496603147.67 KB ( 6.20) (100.00)
</pre>
<p>So the total size of the run directory is nearly 7.5 TB and there are almost one million files. The average size of a file in the run directory is about 8 MB and the maximum size is over 13 GB. The images (represented in the 4096-8192 KB range), comprise over 78% of the files and 73% of the total size of the run directory. This significant penalty can be avoided by using RTA and not transferring image files. The largest files are the alignment (ELAND) outputs and the FASTQ files in the GERALD directory. Speaking of directories, here is a breakdown by number of files in each directory.</p>
<pre style="font-size: x-small; line-height: normal;">
  count=1652 avg=601.02 ents
  min=0.00 ents max=24720.00 ents
              range   count   %tot  %tot cum total ent   %tot  %tot cum
  [    0-    1 ents]:     4 ( 0.24) (  0.24)      0.00 ( 0.00) (  0.00)
  [    2-    3 ents]:     1 ( 0.06) (  0.30)      2.00 ( 0.00) (  0.00)
  [    8-   15 ents]:     3 ( 0.18) (  0.48)     26.00 ( 0.00) (  0.00)
  [   16-   31 ents]:     2 ( 0.12) (  0.61)     44.00 ( 0.00) (  0.01)
  [  128-  255 ents]:     9 ( 0.54) (  1.15)   1826.00 ( 0.18) (  0.19)
  [  256-  511 ents]:  1616 (97.82) ( 98.97) 775680.00 (78.12) ( 78.32)
  [  512- 1023 ents]:     3 ( 0.18) ( 99.15)   2920.00 ( 0.29) ( 78.61)
  [ 1024- 2047 ents]:     4 ( 0.24) ( 99.39)   7845.00 ( 0.79) ( 79.40)
  [ 2048- 4095 ents]:     2 ( 0.12) ( 99.52)   6775.00 ( 0.68) ( 80.08)
  [16384-32767 ents]:     8 ( 0.48) (100.00) 197760.00 (19.92) (100.00)
</pre>
<p>The picture for directory entries is a bit muddled since most of the directories are organized around a small multiple of the number of tiles per lane, falling in the 256-511 entries range. The directories in the 16384-32767 entries range? The image analysis (Firecrest) Temp/L00[1-8] directories, each with 24,720 entries (four <code>clu.txt</code> per tile (one per color) and one <code>qcm.xml</code> (XML!) file for each cycle for each tile in a lane).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics table update</title>
		<link>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html</link>
		<comments>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html#comments</comments>
		<pubDate>Mon, 05 Oct 2009 14:45:02 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1606</guid>
		<description><![CDATA[I have made some updates to the Next-Generation Sequencing Informatics table. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking [...]]]></description>
			<content:encoded><![CDATA[<p>I have made some updates to the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a>. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking at you drd).</p>
<p><strong>Update:</strong> I received some SOLiD 3 number from Nicholas Socci (thanks Nicholas!).</p>
<p><strong>Update2:</strong> I received a fuller set of numbers from drd and the SOLiD 3 column is complete (thanks drd!).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>My secret past</title>
		<link>http://www.politigenomics.com/2009/09/my-secret-past.html</link>
		<comments>http://www.politigenomics.com/2009/09/my-secret-past.html#comments</comments>
		<pubDate>Wed, 16 Sep 2009 15:53:01 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1537</guid>
		<description><![CDATA[Now everyone will know about my secret past before I joined The Genome Center: David Dooling: Gangbusters at the Genome Center. Bio-IT World also has a nice interview with Clive Brown of Oxford Nanopore, whom I first described as the most honest guy in all of next-gen sequencing. By the way, sorry for the extended [...]]]></description>
			<content:encoded><![CDATA[<p>Now everyone will know about my secret past before I joined The Genome Center: <a href="http://www.bio-itworld.com/2009/09/16/NGS-dooling.html">David Dooling: Gangbusters at the Genome Center</a>. Bio-IT World also has a nice <a href="http://www.bio-itworld.com/NGS-Brown.html">interview with Clive Brown</a> of <a href="http://www.nanoporetech.com/">Oxford Nanopore</a>, whom <em>I</em> first described as the <a href="http://www.politigenomics.com/2009/08/another-rich-white-guy-sequences-own-genome.html">most honest guy in all of next-gen sequencing</a>.</p>
<p>By the way, sorry for the extended absence, things have been crazy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/09/my-secret-past.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sour grapes</title>
		<link>http://www.politigenomics.com/2009/08/sour-grapes.html</link>
		<comments>http://www.politigenomics.com/2009/08/sour-grapes.html#comments</comments>
		<pubDate>Mon, 10 Aug 2009 15:13:13 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1425</guid>
		<description><![CDATA[Well, the US is not the only place with interesting politics. I recently came across this letter from Kevin McKernan, Senior Director of Scientific Operations at Applied Biosystems/Life Technologies, to the House of Lords in the UK (pdf). In the letter, McKernan expresses his concern that the Sanger Institute&#8216;s decision to return their SOLiD instruments [...]]]></description>
			<content:encoded><![CDATA[<p>Well, the US is not the only place with interesting politics. I recently came across this <a href="http://www.parliament.uk/documents/upload/101stGMAppliedBiosystems.pdf">letter from Kevin McKernan, Senior Director of Scientific Operations at Applied Biosystems/Life Technologies, to the House of Lords in the UK (pdf)</a>. In the letter, McKernan expresses his concern that the <a href="http://www.sanger.ac.uk/">Sanger Institute</a>&#8216;s decision to <a href="http://www.genomeweb.com/sequencing/sanger-institute-returns-five-solids-life-technologies">return their SOLiD instruments</a> was due to some long-standing resentment of Applied Biosystems due to their association with Craig Venter and his challenge to the <a href="http://www.genome.gov/10001772">Human Genome Project</a>. Obviously there could be no valid scientific reason for their actions. And clearly the House of Lords is in the best position to establish that fact and rectify the situation. Sure, the Sanger Institute receives its funding from the <a href="http://www.wellcome.ac.uk/">Wellcome Trust</a>, an <a href="http://www.wellcome.ac.uk/About-us/index.htm">independent charity</a>, but even if the House of Lords can&#8217;t pull their funding, they can always push an antitrust investigation, right?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/08/sour-grapes.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Second whole cancer genome published</title>
		<link>http://www.politigenomics.com/2009/08/second-whole-cancer-genome-published.html</link>
		<comments>http://www.politigenomics.com/2009/08/second-whole-cancer-genome-published.html#comments</comments>
		<pubDate>Thu, 06 Aug 2009 15:26:33 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1386</guid>
		<description><![CDATA[Today in the New England Journal of Medicine the second paper detailing the whole genome sequencing of a tumor and its matched normal was published: Recurring Mutations Found by Sequencing an Acute Myeloid Leukemia Genome. Accompanying the article is an editorial well worth reading by Jim Downing from St. Jude Children&#8217;s Research Hospital discussing the [...]]]></description>
			<content:encoded><![CDATA[<p>Today in the <a href="http://content.nejm.org/">New England Journal of Medicine</a> the second paper detailing the whole genome sequencing of a tumor and its matched normal was published: <a href="http://content.nejm.org/cgi/content/abstract/NEJMoa0903840v1">Recurring Mutations Found by Sequencing an Acute Myeloid Leukemia Genome</a>. Accompanying the article is an editorial well worth reading by <a href="http://www.stjude.org/stjude/v/index.jsp?vgnextoid=e61e10e88ce70110VgnVCM1000001e0215acRCRD&#038;vgnextchannel=7cc71436e3218010VgnVCM1000000e2015acRCRD">Jim Downing</a> from <a href="http://www.stjude.org/">St. Jude Children&#8217;s Research Hospital</a> discussing the significance of these <a href="http://content.nejm.org/cgi/content/full/NEJMe0906090">whole genome sequence efforts from the medical researcher and practitioner&#8217;s perspective</a>. Dan Koboldt has <a href="http://www.massgenomics.org/2009/08/second-cancer-genome-in-new-england-journal.html">posted a nice summary of the journal article</a>.</p>
<p>The most interesting finding from this research was the recurring mutation in a gene called <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&#038;cmd=Retrieve&#038;dopt=full_report&#038;list_uids=3417">IDH1</a>. The likelihood of the same mutation randomly occurring in 16% of similar AML patients in vanishingly small. In other words, it is quite likely that this mutation plays a role in <a href="http://en.wikipedia.org/wiki/Acute_myeloid_leukemia">AML</a> biology. Mutations in this gene have been <a href="http://content.nejm.org/cgi/content/abstract/360/8/765">previously described in glioblastomas</a> (brain cancer) where they were associated with improved outcome for patients. In contrast, correcting for other mutations which are associated with better outcomes in AML, the IDH1 mutation is associated with poorer outcomes in AML. Thus, while IDH1 seems to play a role in <a href="http://en.wikipedia.org/wiki/Glioblastoma_multiforme">glioblastomas</a> and AML, its role in each may be quite different.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/08/second-whole-cancer-genome-published.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

