<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; software</title>
	<atom:link href="http://www.politigenomics.com/tag/software/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Lightning strike</title>
		<link>http://www.politigenomics.com/2010/04/lightning-strike.html</link>
		<comments>http://www.politigenomics.com/2010/04/lightning-strike.html#comments</comments>
		<pubDate>Thu, 22 Apr 2010 02:25:32 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2147</guid>
		<description><![CDATA[A previous cloud post, Puff piece, has gotten a bit of attention from Jason Stowe and Informatics Iron. While the Informatics Iron piece was positive, Mr. Stowe took issue with some of the points I made. First, he says that my claim that IT and software engineering is needed to get things running on the [...]]]></description>
			<content:encoded><![CDATA[<p>A previous cloud post, <a href="http://www.politigenomics.com/2010/02/puff-piece.html">Puff piece</a>, has gotten a bit of attention from <a href="http://blog.cyclecomputing.com/2010/02/follow-up-on-life-science-reader.html">Jason Stowe</a> and <a href="http://www.genomeweb.com/blog/zero-tolerance-policy-cloud-computing-balderdash">Informatics Iron</a>. While the Informatics Iron piece was positive, Mr. Stowe took <a href="http://www.politigenomics.com/2010/02/puff-piece.html/comment-page-1?comment-16466">issue with some of the points I made</a>. First, he says that my claim that IT and software engineering is needed to get things running on the cloud is inaccurate.<br />
<blockquote>You are implying that to get running in the cloud, an end user must worry about the “IT expertise” and “software engineering” needed to get applications up and running. I believe this is a straw-man, an incorrect assertion to begin with.</p>
<p>One of the major benefits of virtualized infrastructure and service oriented architectures is that they are repeatable and decouple the knowledge of building the service from the users consuming it. This means that one person, who creates the virtual machine images or the server code running the service, does need the expertise to get an application running properly in the cloud. But after that engineering is done once, a whole community of end-users of that service can benefit without knowledge of the specifics of getting the application to scale.</p>
<p>For example, does everyone that uses GMail/Yahoo/Hotmail know every line of software code to make it run? Do they know every operational aspect of how to make mail scale to tens of thousands of processors across many data centers?</p>
<p>Definitely not, and the point is they don’t have to. The same is true for high performance and high throughput computing. To give examples of free services that don’t require end user software engineering or IT expertise to do bioinformatics/proteomics/etc.:
<ul>
<li>The NIH Website for BLAST has, for years, been running BLAST as a service so that researchers can use GUIs to run queries on parallel back-end infrastructure (see http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) This requires no complicated knowledge or software engineering for scientists to run BLAST as a Service.</li>
<li>Tools like ViPDAC have 2-minute tutorial videos to run proteomics on Amazon Web Service.</li>
</ul>
</blockquote>
<p> His argument is absolutely correct when dealing with established systems, applications, and work flows. For use cases like email and running BLAST, there is no need for additional software engineering or IT expertise (other than getting on the internet). In fact, The Genome Center has long offered a <a href="http://genome.wustl.edu/tools/blast">BLAST service</a> for anyone to use. Further, over the past few weeks, several prepackaged bioinformatics work flows that run on the cloud (or some approximation thereof) have been announced: Mr. Stowe&#8217;s company Cycle Computing announced <a href="http://www.cyclecomputing.com/news/28-newsitems/214-cycle-computing-launches-cyclecloud-for-life-sciences-product-family-at-xgen-sequencing">CycleCloud for Life Sciences</a>, <a href="http://www.genomequest.com/">GenomeQuest SDM</a>, <a href="http://www.cloudbiolinux.com/">Cloud Bio-Linux</a> from Bio-Team, ChIP-seq and RNA-seq analysis pipelines from <a href="https://dnanexus.com/">DNAnexus</a>, the work flows available in <a href="http://bitbucket.org/galaxy/galaxy-central/wiki/cloud">Galaxy</a>, and of course the previously published <a href="http://genomebiology.com/2009/10/11/R134">Crossbow</a>.  Unfortunately, canned analyses are not the norm in bioinformatics. Bioinformaticians love to tinker, trying to get just a little more biological information out of their data sets. The result is that bioinformatics applications and work flows are constantly being tweaked, updated, and improved. Because of this, maintenance of these pipelines is a huge burden. The supporters of these generic pipelines must work constantly to update and verify software or the users will constantly be waiting for the latest fix to be applied or latest feature to be available (anyone who installs each new version of <a href="http://www.ebi.ac.uk/~zerbino/velvet/">velvet</a> can attest to this). The saving grace in all of this is that as the use of sequencing becomes more widespread, the percentage of the people doing the analysis that classify as bioinformaticians will decrease (greatly). This means that a larger and larger percentage of people with sequence data to analyze will likely not be interested in tweaking analysis pipelines but will just want to run something and get an answer. It is this ever growing group of people that will greatly benefit from easy to use analysis tools, whether they be deployed on the cloud or not. Both Mr. Stowe and I agree that creating easy to use tools for non-bioinformaticians to use is a very worthwhile goal. Unfortunately the proliferation of existing tool options (e.g., maq, bwa, bowtie, bfast, soap, novoalign, etc.) now layered with a proliferation of cloud offerings will make it even more difficult for non-experts to chose which pipeline is the best to use. Therefore approaches like those taken by Cycle Computing and GenomeQuest that provide default analysis pipelines <strong>and</strong> the ability for bioinformaticians to create and share their own work flows are the most likely to be successful. The development of these generic, distributed analysis frameworks that also provide useful defaults is an even more worthwhile goal because it achieves two important ends: ease of use for non-experts and the ability for bioinformaticians to tinker. Bioinformaticians are more likely to find tools like these useful and therefore will be early adopters, choose the best platforms, establish best-practices on these platforms, publish results using these platforms, and <em>then</em> the non-experts will follow.</p>
<p>Mr. Stowe&#8217;s other objection related to my point that no process scales linearly with the number of cores. He concedes that point but points out<br />
<blockquote>In fact, regardless of whether the job is linearly scalable, most companies and research institutions don’t have 1 cluster to 1 user scenarios. There are multiple users with multiple jobs each. What if you have 10 crossbow users with 10 runs to do on various genomes? Then you can get 100x performance on the *workflow as a whole*.</p></blockquote>
<p> Again, this is true, but, to be fair, that is not the same point he made in his original article. His original point was that if <em>you</em> needed <em>your</em> analysis to run faster you could just provision more nodes. I just pointed out that this is true, but <em>you</em> would likely pay a premium for that because <strong>nothing</strong> scales linearly. It may seem like a fine distinction, but with all the misinformation around clouds nowadays, it&#8217;s an important one to make. It should also be noted that without good software engineering and system administration, even algorithms that should scale nearly linearly might not. The take-home message is that if someone has done that software engineering and systems administration work to make a program scale well and run well in a cloud envrionment and made it available to you, great. If not, someone is going to have to do it.</p>
<p>I had the opportunity to meet Mr. Stowe at the XGen Congress and have talked more with him this week at <a href="http://www.bio-itworldexpo.com/">Bio-IT World Conference and Expo</a> (my talk is tomorrow at 11 a.m. EDT in <a href="http://www.bio-itworldexpo.com/Bio-It_Expo_Content.aspx?id=94894">Track 3: Bioinformatics and Next-Gen Data</a>). We had a good discussion about cloud computing and its role in bioinformatics (they&#8217;ve got a cool solution to the Amazon storage problem). As you can hopefully tell from this post, we are largely in agreement: engineering is needed, but once it is done, everyone benefits. <a href="http://www.cyclecomputing.com/">Cycle Computing</a> certainly has a lot of good expertise in the cloud, so if you need some engineering done, shoot him an email. Unfortunately, they probably will not be able to help you access the <a href="http://www.networkworld.com/community/node/58829">largest cloud computing service</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/04/lightning-strike.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Me, in podcast form</title>
		<link>http://www.politigenomics.com/2010/02/me-in-podcast-form.html</link>
		<comments>http://www.politigenomics.com/2010/02/me-in-podcast-form.html#comments</comments>
		<pubDate>Wed, 24 Feb 2010 14:40:17 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2158</guid>
		<description><![CDATA[I recently did an interview in advance of my talk at the XGen Congress next month in San Diego. The interview is about 14 minutes and discusses our work at The Genome Center in general and more specifically the software and IT infrastructure we have created to enable the analysis of the massive amounts of [...]]]></description>
			<content:encoded><![CDATA[<p>I recently did an interview in advance of my talk at the <a href="http://www.healthtech.com/xgn">XGen Congress</a> next month in San Diego. The interview is about 14 minutes and discusses our work at <a href="http://genome.wustl.edu/">The Genome Center</a> in general and more specifically the software and IT infrastructure we have created to enable the analysis of the massive amounts of sequence data we generate. The interview is available to download as part of the <a href="http://www.healthtech.com/Conferences_Overview.aspx?ekfrm=97046">XGen Congress podcast series</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/me-in-podcast-form.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Puff piece</title>
		<link>http://www.politigenomics.com/2010/02/puff-piece.html</link>
		<comments>http://www.politigenomics.com/2010/02/puff-piece.html#comments</comments>
		<pubDate>Tue, 16 Feb 2010 22:19:24 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2106</guid>
		<description><![CDATA[Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, Hype Cycle for Cloud Computing, 2009, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: [...]]]></description>
			<content:encoded><![CDATA[<p>Why should one be skeptical of all the information touting the wonders of cloud computing? This older, in-depth piece by Gartner, <a href="http://www.gartner.com/technology/media-products/reprints/bmc/article12/article12.html">Hype Cycle for Cloud Computing, 2009</a>, lays out the reasons pretty well. But one need not spend that much time reading about it. You can simply read this much shorter piece by Jason Stowe: <a href="http://www.lifescienceleader.com/index.php?option=com_jambozine&#038;layout=article&#038;view=page&#038;aid=3973">Is the Future Of High- Performance Computing For Life Sciences Cloudy?</a> Reading that story, one can only get the impression that the cloud is some panacea where all computational problems are solved. In fact, the picture is so rosy that one may become suspicious. So suspicious that one may read the <em>About the Author</em> section at the bottom of the piece an see that Mr. Stowe happens to be CEO of a company selling cloud computing services.<br />
<blockquote>Jason Stowe is the founder and CEO of Cycle Computing, a provider of high-performance computing (HPC) and open source technology in the cloud. A seasoned entrepreneur and experienced technologist, Jason attended Carnegie Mellon and Cornell Universities.</p></blockquote>
<p> No wonder he makes cloud computing sound so attractive. No mention of the IT expertise needed to get up and running on the cloud. No mention of the software engineering needed to ensure your programs run efficiently on the cloud. It may not be apparent from his article, but a program that runs well on one or ten computers does not necessarily run well on hundreds of computers. In fact, he implies the exact opposite.<br />
<blockquote>For compute clusters as a service, the math is different: Having 40 processors work for 100 hours costs the same as having 1,000 processors run for 4 hours.</p></blockquote>
<p> It may cost the same under that scenario, but not everything scales linearly. In fact, most things don&#8217;t and that less-than-linear scaling actually ends up making it cost <em>more</em> to get a shorter turnaround. This fact was clearly evident in the <a href="http://genomebiology.com/2009/10/11/R134">Crossbow paper</a> where it cost $52 to complete the analysis in 6.5 hours but $84 to finish it under 3 hours (Table 4). The article fails to mention this; a marvel given the fact that the lack of good, scalable bioinformatics tools that can run well in highly parallel environments is perhaps the largest impediment to the adoption cloud computing in bioinformatics. Of course, I am sure he will gladly sell you consulting services that will get you up and running on the cloud. In short, this looks like a shill.</p>
<p>Unfortunately, omitting information is not the only problem with many of the stories about cloud computing; many also contain misinformation. For example, the story <a href="http://www.nature.com/nbt/journal/v28/n1/full/nbt0110-1.html">Gathering clouds and a sequencing storm</a> in Nature Biotechnology mentions the software engineering challenges but erroneously states<br />
<blockquote>&hellip;bioinformaticians might not be willing to spend the time to familiarize themselves with hadoop, the open source program needed to process large data sets on a cloud</p></blockquote>
<p> What?!? You do not <em>have</em> to develop tools using <a href="http://hadoop.apache.org/">Hadoop</a>. Sure it is a nice platform that provides fault-tolerant parallelism, but it is by no means required by any cloud provider that I know of (not even Google, whose MapReduce framework provided the model for Hadoop!) nor is it the only way to achieve parallel processing (far from it). Amazon EC2 just provides you with a virtual machine with a basic operating system installed on it and remote access. You can do whatever you want with it after that. Google and Microsoft do require that you develop your code in their cloud framework, but you do not have to use Hadoop. For information on what you <em>do</em> have to do to run jobs on the major cloud providers, check out this article by Udayan Banerjee, <a href="http://cloudcomputing.sys-con.com/node/1257999">Cloud Economics &mdash; Amazon, Microsoft, Google Compared</a>, and each providers web site: <a href="http://aws.amazon.com/">Amazon AWS</a>, <a href="http://code.google.com/appengine/">Google App Engine</a>, and <a href="http://www.microsoft.com/windowsazure/windowsazure/">Microsoft Windows Azure</a>.</p>
<p><em>(How many bad cloud puns can I work into post titles? Stay tuned.)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/puff-piece.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Bioinformatics and cloud computing</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html</link>
		<comments>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html#comments</comments>
		<pubDate>Tue, 24 Nov 2009 19:54:22 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[cloud]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728</guid>
		<description><![CDATA[From the Using clouds for parallel computations in systems biology workshop at the recent SC09 conference (Informatics Iron writeup) to last month&#8217;s Genome Informatics meeting, everyone in bioinformatics is talking about cloud computing these days. Last week Steven Salzberg&#8216;s group published a paper on their Crossbow tool entitled Searching for SNPs with cloud computing (Cloudera [...]]]></description>
			<content:encoded><![CDATA[<p>From the <a href="http://www.mcs.anl.gov/events/workshops/sc09-sysbio/index.php">Using clouds for parallel computations in systems biology</a> workshop at the recent <a href="http://sc09.supercomputing.org/">SC09 conference</a> (<a href="http://www.genomeweb.com/blog/cloud-bio-computing-sc09">Informatics Iron writeup</a>) to last month&#8217;s <a href="http://www.genomeweb.com/informatics/genome-informatics-speakers-say-second-gen-sequencing-makes-giddy-times-bioinfor">Genome Informatics meeting</a>, everyone in bioinformatics is talking about cloud computing these days. Last week <a href="http://genome.fieldofscience.com/">Steven Salzberg</a>&#8216;s <a href="http://www.cbcb.umd.edu/~salzberg/">group</a> published a paper on their Crossbow tool entitled <a href="http://genomebiology.com/2009/10/11/R134">Searching for SNPs with cloud computing</a> (<a href="http://www.cloudera.com/blog/2009/10/15/analyzing-human-genomes-with-hadoop/">Cloudera blog post on Crossbow</a>). In the paper the authors describe how they were able to analyze the human sequence data <a href="http://www.nature.com/nature/journal/v456/n7218/abs/nature07484.html">published last year by BGI</a> using <a href="http://aws.amazon.com/ec2/">Amazon EC2</a>.  Specifically, they have developed an alignment (<a href="http://bowtie-bio.sourceforge.net/index.shtml">bowtie</a>) and SNP detection (<a href="http://soap.genomics.org.cn/soapsnp.html">SoapSNP</a>) pipeline that is executed in parallel across a cluster using the <a href="http://hadoop.apache.org/">Hadoop</a> framework (a <a href="http://fsf.org/">free software</a> implementation of <a href="http://labs.google.com/papers/mapreduce.html">Google&#8217;s MapReduce</a> framework).  Using a 40-node, 320-core EC2 cluster, they were able to analyze 38&times; coverage sequence data in about three hours. The whole analysis, including data transfer and storage on <a href="http://aws.amazon.com/s3/">Amazon S3</a>, cost about $125. You can find a more detailed cost breakdown and comparison on Gary Stiehr&#8217;s <a href="http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/">HPCInfo post<a/> and more detail on the SNP detection on Dan Koboldt&#8217;s <a href="http://www.massgenomics.org/2009/11/crossbow-ngs-informatics-in-the-cloud.html">Mass Genomics post</a>.</p>
<p>For analyzing a single genome, you really can&#8217;t beat that price.  Of course, at the rate next-generation sequencing instruments are generating data, most people are not going to want to analyze just one genome. So the question becomes, what is the break even point? That is, how many genomes do you have to sequence to make buying compute resources cheaper than renting them from Amazon? We currently estimate that the fully loaded (node, chassis, rack, networking, etc.) cost of a single computational core is about $500. Thus, to purchase 320 cores would cost you about $160,000.  It&#8217;s going to take a lot (1280) genomes to hit that break even point. But, do you really need to analyze a genome in three hours? With the current per run throughput of a single Illumina GA IIx, it would take about four ten-day runs (40 days) to generate 38&times; coverage of a human genome. After each run, you could align the sequence data from that run. Each lane of data would take 8-12 core&middot;hours to align, so a whole run&#8217;s (eight lanes&#8217;) worth of data would take about 80 core&middot;hours. Therefore, even if you had just one core, you could align all the data before the next run completed. The consensus calling and variant detection portions of the pipeline typically take a handful of core&middot;hours and therefore do not change the economics; they too can be completed before the first run of the next genome is completed. Thus, with a $500 investment in computational resources, you can more than keep pace with the Illumina instrument. Note that I am completely excluding the cost of storage, as that will be needed for the data and results regardless of where the computation is done. Of course, you probably wouldn&#8217;t buy just <em>one</em> core. Checking over at the <a href="http://www.dell.com/us/en/highered/df.aspx?refid=df&#038;s=hied&#038;cs=RC956904&#038;~ck=mn">Dell Higher Education web site</a>, you can get a Quad Core Precision T3500n with 4 GiB of RAM (more RAM per core than the <a href="http://aws.amazon.com/ec2/#instance">Amazon EC2 Extra Large Instance</a> used in the paper) and 750 GB local storage capacity (about the same storage per core as the Extra Large Instance) for $1700. You would need less than one core&#8217;s (25%) of that workstation&#8217;s capacity dedicated to alignment of and variant detection on data from a single Illumina GA IIx (thanks to <a href="http://en.wikipedia.org/wiki/Burrows-Wheeler_transform">Burrows-Wheeler Transform</a> aligners like bowtie and <a href="http://bio-bwa.sourceforge.net/">bwa</a>). Using the single core numbers, the break even point for purchase versus cloud is less than five whole genomes. Using  the entire cost of the Dell workstation (even though you require less than 25% of its computational capacity), the break even point is about 14 genomes. It would take about 1.5 years (about half the expected life of IT hardware) at current throughput to sequence 14 genomes with a single Illumina GA IIx. At data rates expected in January 2010, it would take less than a year to break even.</p>
<p>These numbers indicate that unless you are just sequencing a few genomes, you are probably better off purchasing a (possibly single node) cluster. With the proliferation of sequencing applications and publications in the last couple years, not many researchers will fall into the &#8220;few genomes&#8221; bin. Our experience has been that the more sequencing data people get, the more they want. Another way to look at this is that the entire analysis computational hardware costs (<$1700) is less than 1% of the sequencing instrument cost; or the computational cost to analyze a whole genome (<$500) is less than 1% of the total data generation costs (reagents, flow cells, instrument depreciation, technician time, etc.). This is all not to say that there is not a place for cloud and other distributed computing frameworks in bioinformatics, but that's the topic of a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/feed</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>OSCON lead up</title>
		<link>http://www.politigenomics.com/2009/07/oscon-lead-up.html</link>
		<comments>http://www.politigenomics.com/2009/07/oscon-lead-up.html#comments</comments>
		<pubDate>Mon, 13 Jul 2009 17:49:14 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[OSCON]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1323</guid>
		<description><![CDATA[Last week I did an interview with James Turner at O&#8217;Reilly about my upcoming talk at OSCON. It turns out that James is a bit of a genomics nut and therefore had a lot of insightful questions about the current state of genomics and health. Hopefully my responses were as good as his questions. You [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I did an interview with <a href="http://radar.oreilly.com/jamest/">James Turner</a> at <a href="http://oreilly.com/">O&#8217;Reilly</a> about my <a href="http://www.politigenomics.com/2009/04/oscon-2009.html">upcoming talk at OSCON</a>. It turns out that James is a bit of a genomics nut and therefore had a lot of insightful questions about the current state of genomics and health. Hopefully my responses were as good as his questions. You can judge for yourself by listening to the interview or reading the transcript: <a href="http://radar.oreilly.com/2009/07/sequencing-a-genome-a-week.html">Sequencing a Genome a Week</a>.</p>
<p><strong>Update:</strong> The story has been posted on <a href="http://science.slashdot.org/story/09/07/13/2129229/Sequencing-a-Human-Genome-In-a-Week">Slashdot</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/07/oscon-lead-up.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>VarScan published</title>
		<link>http://www.politigenomics.com/2009/06/varscan-published.html</link>
		<comments>http://www.politigenomics.com/2009/06/varscan-published.html#comments</comments>
		<pubDate>Tue, 23 Jun 2009 13:04:19 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1294</guid>
		<description><![CDATA[VarScan, a tool developed at The Genome Center to detect variants in massively parallel sequence data has been published in Bioinformatics. VarScan can process both 454 and Solexa data of individuals or pools. You can find more information about VarScan in a post by Dan Koboldt, one of the paper&#8217;s and VarScan&#8217;s authors.]]></description>
			<content:encoded><![CDATA[<p><a href="http://genome.wustl.edu/tools/cancer-genomics#varscan">VarScan</a>, a tool developed at <a href="http://genome.wustl.edu/">The Genome Center</a> to detect variants in massively parallel sequence data has been published in <a href="http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp373">Bioinformatics</a>. VarScan can process both 454 and Solexa data of individuals or pools. You can find more information about <a href="http://www.massgenomics.org/2009/06/variant-detection-in-massively-parallel-sequencing.html">VarScan in a post by Dan Koboldt</a>, one of the paper&#8217;s and VarScan&#8217;s authors.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/varscan-published.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>UR explained</title>
		<link>http://www.politigenomics.com/2009/06/ur-explained.html</link>
		<comments>http://www.politigenomics.com/2009/06/ur-explained.html#comments</comments>
		<pubDate>Fri, 19 Jun 2009 15:54:51 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1285</guid>
		<description><![CDATA[Tony Brummet from The Genome Center gave a presentation earlier this week at the St. Louis Perl Mongers meeting on UR. The kind folks at StL.pm have posted the videos for the geographically challenged to enjoy. Part 1 Part 2 You can find a PDF of the slide deck in the Files section of the [...]]]></description>
			<content:encoded><![CDATA[<p>Tony Brummet from <a href="http://genome.wustl.edu/">The Genome Center</a> gave a presentation earlier this week at the <a href="http://stlouisperlmongers.blogspot.com/">St. Louis Perl Mongers</a> meeting on <a href="http://www.politigenomics.com/2009/05/ur-so-beautiful-to-me.html">UR</a>. The kind folks at StL.pm have <a href="http://stlouisperlmongers.blogspot.com/2009/06/videos-from-ur-presentation.html">posted the videos</a> for the geographically challenged to enjoy.</p>
<div class="widevideo"><embed src="http://blip.tv/play/AYGK2F+V9Dw" type="application/x-shockwave-flash" width="500" height="297" allowscriptaccess="always" allowfullscreen="true"></embed>Part 1</div>
<div class="widevideo"><embed src="http://blip.tv/play/AYGK2iiV9Dw" type="application/x-shockwave-flash" width="500" height="297" allowscriptaccess="always" allowfullscreen="true"></embed>Part 2</div>
<p>You can find a PDF of the slide deck in the <a href="http://groups.google.com/group/stl-pm/files">Files section of the StL.pm Google Group page</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/ur-explained.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Illumina cluster needs</title>
		<link>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html</link>
		<comments>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html#comments</comments>
		<pubDate>Thu, 18 Jun 2009 16:13:36 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[LSF]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1254</guid>
		<description><![CDATA[There is an interesting thread over at the Solexa Google Group about the IT infrastructure needed to support an Illumina Genome Analyzer (GA). The discussion focuses mostly on the cluster and, to a lesser extent, the storage and network required to operate the instrument and generate sequence data (primary analysis). At The Genome Center, we [...]]]></description>
			<content:encoded><![CDATA[<p>There is an interesting thread over at the <a href="http://groups.google.com/group/solexa?hl=en">Solexa Google Group</a> about the <a href="http://groups.google.com/group/solexa/browse_thread/thread/38ff88dcf5f5df17?hl=en">IT infrastructure needed to support an Illumina Genome Analyzer (GA)</a>. The discussion focuses mostly on the cluster and, to a lesser extent, the storage and network required to operate the instrument and generate sequence data (primary analysis). At <a href="http://genome.wustl.edu/">The Genome Center</a>, we use Platform LSF HPC as our batch scheduler and currently use <a href="http://www.politigenomics.com/2008/03/illumina-genome-analyzer-pipeline-and.html">lsgmake-gap</a> to execute the GAPipeline (the Illumina primary analysis software) in parallel on our cluster. However, GAPipeline is developed and tested by Illumina on a cluster managed by <a href="http://www.sun.com/software/sge/">Sun Grid Engine (SGE)</a>, which is <a href="http://gridengine.sunsource.net/">free/open source software</a>. This situation creates some headaches for us because as the internals of GAPipeline change, we need to <a href="http://www.politigenomics.com/2009/02/lsgmake-gap-for-gapipeline-13.html">regularly update lsgmake-gap</a> so that GAPipeline will continue to run properly on our cluster. Several years ago when we migrated to LSF, the driving force for the selection of LSF was that it was the only batch scheduler that could handle scheduling 50,000+ jobs at a time (a regular occurrence on our cluster). Fortunately, users may no longer have to choose between scalability and ease of use when running GAPipeline as part of their larger computational needs. Chris Dagdigian, who writes the <a href="http://gridengine.info/">gridengine.info blog</a>, had this information about the current capabilities of SGE.</p>
<blockquote><ol>
<li>SGE 6.2 design goal includes supporting a single array job with 500,000 tasks and hundreds of thousands of concurrent jobs</li>
<li>People have been running hundreds of thousands of SGE jobs per week since the SGE 5.3 days many years ago
<li>I personally know of several sites pushing hundreds of thousands of heavy SGE jobs per week through their systems right now
<li>SGE 6.2 runs a 62,000 core cluster in Texas (RANGER) and has been for some time</li>
</ol>
<p>&#8220;tens of thousands of jobs&#8221; is actually pretty easy with Grid Engine and has been for some time, scaling issues encountered in this range have more to do with bad spooling decisions, filesystem design and occasionally an overwhelmed qmaster host. The developers have worked quite a bit this year to improve threading performance, reduce memory footprints and remove things like external RSH methods that consumed system resources like filehandles and TCP ports etc.</p>
<p>This is especially evident in the SGE 6.2  and 6.2u1 release series where speed and scaling were specifically addressed as part of the design effort (6.2u3 and 6.3 will introduce new features). This is the reason why the <em>SGE scheduler is now a thread within the qmaster</em> &#8211; one of the more obvious user-visible changes made recently. (emphasis mine &#8211; dd)</p>
<p>There are many reasons why one would chose between LSF vs SGE (I have used both for years now) but scaling is not one of the significant selection factors. Features, price, APIs and quality of documentation are far more important along with community adoption/support.</p>
</blockquote>
<p>I would guess breaking out the scheduler into its own thread is a major factor in SGE&#8217;s ability to manage so many jobs. This was the major deficiency of SGE and other batch schedulers we tested at the time. Several systems designed their schedulers to automatically run through the list of jobs a certain intervals. With a lot of jobs in the queue, the scheduler would not finish its previous traversal before the new one was scheduled to start. Depending on the design implementation this meant that either the original scheduling was killed and the scheduler never processed some jobs or that scheduler threads kept spawning until the resources were exhausted on the master node (that&#8217;s bad).</p>
<p>(A couple asides here, since GAPipeline is built on Makefile&#8217;s, another option that came up in the thread was parallel execution across an LSF cluster using <a href="http://distmake.sourceforge.net/pmwiki/pmwiki.php">distmake</a>. Because of <a href="http://hpcinfo.com/">our interest</a> in <a href="http://www.opensciencegrid.org/">grid computing</a>, we are currently investigating replacing LSF with <a href="http://www.cs.wisc.edu/condor/">Condor</a>.)</p>
<p>Of course, with the roll out of SCS2.4 with RTA (real-time analysis), most of the primary analysis is now done on the instrument control computer. Thus, all of this talk about the requirements to produce sequence from the machine are made much less important. Now there is only one stage of the pipeline, the alignment and reporting (called GERALD), now run off the instrument computer. The most computationally intensive part of this stage of the pipeline is the alignment (ELAND and its post-processing) and it can only be made parallel on a per lane basis, i.e., eight ways.</p>
<p>Of course, there is also the specter of the requirements for sequence analysis at Illumina GA IIx scale, but that&#8217;s a whole other post&hellip;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/illumina-cluster-needs.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Learning opportunities</title>
		<link>http://www.politigenomics.com/2009/06/learning-opportunities.html</link>
		<comments>http://www.politigenomics.com/2009/06/learning-opportunities.html#comments</comments>
		<pubDate>Wed, 17 Jun 2009 21:28:07 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1252</guid>
		<description><![CDATA[These links came to my attention this past weekend and I thought they might be of use to some of the readers here. First, you can access all course materials, even lectures, for the CS61A: Structure and Interpretation of Computer Programs course at UC Berkeley. The course comes highly recommended. Second, Melissa Kahney has aggregated [...]]]></description>
			<content:encoded><![CDATA[<p>These links came to my attention this past weekend and I thought they might be of use to some of the readers here. First, you can access all course materials, even lectures, for the <a href="http://inst.eecs.berkeley.edu/~cs61a/sp08/">CS61A: Structure and Interpretation of Computer Programs</a> course at UC Berkeley. The course comes highly recommended. Second, Melissa Kahney has aggregated links for a bunch of <a href="http://educhoices.org/articles/Useful_Tutorials_on_Linux_and_UNIX_for_Beginners_and_Experts_Alike.html">UNIX and GNU/Linux tutorials</a> grouped by topic and target audience (beginner and expert).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/learning-opportunities.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Great Expectations</title>
		<link>http://www.politigenomics.com/2009/06/great-expectations.html</link>
		<comments>http://www.politigenomics.com/2009/06/great-expectations.html#comments</comments>
		<pubDate>Mon, 15 Jun 2009 14:15:38 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[OSCON]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1246</guid>
		<description><![CDATA[A colleague of mine at The Genome Center pointed me to this O&#8217;Reilly Radar blog post about the talks at OSCON 2009 that Allison Randal, one of the organizers, considers highlights. Very kindly, she mentions my talk, The Freedom to Cure Cancer. I have a rough outline of the talk clanging around in my head. [...]]]></description>
			<content:encoded><![CDATA[<p>A colleague of mine at <a href="http://genome.wustl.edu/">The Genome Center</a> pointed me to this <a href="http://radar.oreilly.com/2009/06/oscon-2009-highlights-and-earl.html">O&#8217;Reilly Radar blog post about the talks at OSCON 2009 that Allison Randal, one of the organizers, considers highlights</a>. Very kindly, she mentions my talk, <a href="http://en.oreilly.com/oscon2009/public/schedule/detail/7985">The Freedom to Cure Cancer</a>. I have a rough outline of the talk clanging around in my head. Having it take shape on a slide deck is going to take some work (and a lot of time on Google image search). Hopefully, the talk will live up to the hype.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/great-expectations.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

