<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Bioinformatics and cloud computing</title>
	<atom:link href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Mon, 31 Oct 2011 00:27:27 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: dd</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15718</link>
		<dc:creator>dd</dc:creator>
		<pubDate>Mon, 11 Jan 2010 00:23:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15718</guid>
		<description>Bob, regarding the costs, I used the same numbers you are. The analysis in the paper took 320 cores &#215; 3 hr = 960 core&#215;hr and did require the large instance (because of the memory requirements of bowtie). Those 320 cores were spread across 40 instances, so the computational cost was 40 &#215; 3 hr &#215; $0.68/core&#215;hr = $81.60. The additional cost was for data transfer and storage in S3, bringing it up to about $125.

For responses to other parts of your comment (and other people&#039;s comments), see my subsequent post, &lt;a href=&quot;http://www.politigenomics.com/2010/01/head-in-the-clouds.html&quot; rel=&quot;nofollow&quot;&gt;Head in the clouds&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Bob, regarding the costs, I used the same numbers you are. The analysis in the paper took 320 cores &times; 3 hr = 960 core&times;hr and did require the large instance (because of the memory requirements of bowtie). Those 320 cores were spread across 40 instances, so the computational cost was 40 &times; 3 hr &times; $0.68/core&times;hr = $81.60. The additional cost was for data transfer and storage in S3, bringing it up to about $125.</p>
<p>For responses to other parts of your comment (and other people&#8217;s comments), see my subsequent post, <a href="http://www.politigenomics.com/2010/01/head-in-the-clouds.html" rel="nofollow">Head in the clouds</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15604</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Fri, 11 Dec 2009 21:39:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15604</guid>
		<description>I had the opposite reaction -- the cluster pricing from Amazon seems like a bargain.

To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost.  For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)?  How much space do they take up?  The power for these beasts is not inconsiderable.  

When you run machines hammer and tongs with disks flying and memory working full tilt, they tend to wear out pretty quickly. 

My wife&#039;s having trouble with her cluster at NYU because the building&#039;s heating and cooling are both tied to the same faulty plumbing system; so even though it&#039;s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two.  Just like when the AC went out in the summer.

NYU&#039;s machines are also prone to infection by viruses.  They had to completely rebuild their SOLiD cluster for that reason, which also set them back in time and money.  It&#039;s such a huge problem that SOLiD service reps just show up with giveaway 1GB thumb drives they won&#039;t even take back. 

Amazon&#039;s pricing seems to have gone down from what you&#039;re quoting.  Amazon&#039;s EC2 extra-large instance gives you a four-core 15GB machine for US$0.68/hour, or a two-core, 8GB machine for half that.  If you can get away with a 32-bit OS on a single core, it&#039;s only US$0.085/hour.  That translates into 100 days (400 core days), 200 days (400 core days), and 800 days for $1700. 

Do you really only run an analysis once?  I see people continually rerunning with different settings, different software, against different assemblies, with public (e.g. GEO) data, etc.  All of which requires more overhead at particular times, not necessarily a huge cluster all the time.</description>
		<content:encoded><![CDATA[<p>I had the opposite reaction &#8212; the cluster pricing from Amazon seems like a bargain.</p>
<p>To repeat what Ben Langmead said above, the total cost of ownership of a computer, even for a university, is much higher than its purchase cost.  For instance, how many computers does each sysadmin manage (or how much time does it take to manage new operating system patches, software installs, etc.)?  How much space do they take up?  The power for these beasts is not inconsiderable.  </p>
<p>When you run machines hammer and tongs with disks flying and memory working full tilt, they tend to wear out pretty quickly. </p>
<p>My wife&#8217;s having trouble with her cluster at NYU because the building&#8217;s heating and cooling are both tied to the same faulty plumbing system; so even though it&#8217;s winter here in NYC, when the heat went out, so did the machine room cooling, so they had to shut down all the machines for a day or two.  Just like when the AC went out in the summer.</p>
<p>NYU&#8217;s machines are also prone to infection by viruses.  They had to completely rebuild their SOLiD cluster for that reason, which also set them back in time and money.  It&#8217;s such a huge problem that SOLiD service reps just show up with giveaway 1GB thumb drives they won&#8217;t even take back. </p>
<p>Amazon&#8217;s pricing seems to have gone down from what you&#8217;re quoting.  Amazon&#8217;s EC2 extra-large instance gives you a four-core 15GB machine for US$0.68/hour, or a two-core, 8GB machine for half that.  If you can get away with a 32-bit OS on a single core, it&#8217;s only US$0.085/hour.  That translates into 100 days (400 core days), 200 days (400 core days), and 800 days for $1700. </p>
<p>Do you really only run an analysis once?  I see people continually rerunning with different settings, different software, against different assemblies, with public (e.g. GEO) data, etc.  All of which requires more overhead at particular times, not necessarily a huge cluster all the time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Talking about clouds, TDWG and Eucalyptus &#124; fak3r</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15571</link>
		<dc:creator>Talking about clouds, TDWG and Eucalyptus &#124; fak3r</dc:creator>
		<pubDate>Fri, 04 Dec 2009 18:29:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15571</guid>
		<description>[...] I&#8217;m working with Eucalyptus learning how to set it up, and then configure a slim Linux image that could be scaled out. From there, add the useful applications to it, make it a template others could use on their own Euca setups, or EC2, or both, to do map/reduce, or whatever work they want. This is where my expertise ends, I just want to facilitate the community to be able to get to that point. But, to address that point &#8211; I sent an email out to the group: &#8220;All &#8212; Nick posted this to Twitter, but I wanted to highlight it for everyone here http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html [...]</description>
		<content:encoded><![CDATA[<p>[...] I&#8217;m working with Eucalyptus learning how to set it up, and then configure a slim Linux image that could be scaled out. From there, add the useful applications to it, make it a template others could use on their own Euca setups, or EC2, or both, to do map/reduce, or whatever work they want. This is where my expertise ends, I just want to facilitate the community to be able to get to that point. But, to address that point &#8211; I sent an email out to the group: &#8220;All &#8212; Nick posted this to Twitter, but I wanted to highlight it for everyone here <a href="http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html" rel="nofollow">http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sucha Sudarsanam</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15567</link>
		<dc:creator>Sucha Sudarsanam</dc:creator>
		<pubDate>Thu, 03 Dec 2009 21:21:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15567</guid>
		<description>One important aspect of cloud computing is not technical at all, not how much CPU, memory, storage, etc. There is an important social aspect. Any one who wants to try a novel computational idea or start a business based even on an existing method can now do so with minimal friction. If the idea works it is great otherwise you turn off the server in the cloud and walk away.

My hope is that cloud computing will help generate innovative ideas and help to implement novel business plans.</description>
		<content:encoded><![CDATA[<p>One important aspect of cloud computing is not technical at all, not how much CPU, memory, storage, etc. There is an important social aspect. Any one who wants to try a novel computational idea or start a business based even on an existing method can now do so with minimal friction. If the idea works it is great otherwise you turn off the server in the cloud and walk away.</p>
<p>My hope is that cloud computing will help generate innovative ideas and help to implement novel business plans.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: moondog</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15562</link>
		<dc:creator>moondog</dc:creator>
		<pubDate>Wed, 02 Dec 2009 21:53:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15562</guid>
		<description>Won&#039;t it be great when we can just type this :

cat *.fq &#124; bwa &#124; sam2bam &#124; ./findsnps &#124; ./filter4reality &#124; sort &#124;uniq -c &#124; sort -n &#124; head

then go check your fave blog for a minute or two, come back and see the results.

someday.</description>
		<content:encoded><![CDATA[<p>Won&#8217;t it be great when we can just type this :</p>
<p>cat *.fq | bwa | sam2bam | ./findsnps | ./filter4reality | sort |uniq -c | sort -n | head</p>
<p>then go check your fave blog for a minute or two, come back and see the results.</p>
<p>someday.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Clive G. Brown</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15560</link>
		<dc:creator>Clive G. Brown</dc:creator>
		<pubDate>Wed, 02 Dec 2009 18:51:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15560</guid>
		<description>Hi David,

An excellent analysis - I am also a bit of a cloud scpetic (somebody has to be). You are absolutely right, there is enough time to analyse a GA run on a pretty cheap computer. Even with storage, as long as you dont keep more than a couple of runs worth of raw data, you are looking at a relatively cheap system. Amortise that over the lifetime of the instrument (or number of runs) and it comes out pretty competitive at around $150 per run all in (yes with heat and power etc). (This, and yours, are a class of calculation that goes right back to the early days of Solexa - nothing new).

Even if cloud is a bit cheaper, we&#039;re clearly not talking 10-100X - there&#039;s a lot of wastage elsewhere in most operations that can drown out such small cost savings - like failed runs, libraries and reagents kits which can easily add up to many thousands (not to mention employee costs which are always the largest).


(There are also benefits to owning the hardware, like control - and Im not convinced that all of that software can be run concurrently, with many different resource usage profiles, in way that maintains the low costs and short execution times, at least not without a lot of re-writing of software. A lot of next-gen seq bioinf apps have heavy and demanding IO requirements, im not sure that can be generically abstracted in a way that ensures every user gets what they want all of the time.....)

c</description>
		<content:encoded><![CDATA[<p>Hi David,</p>
<p>An excellent analysis &#8211; I am also a bit of a cloud scpetic (somebody has to be). You are absolutely right, there is enough time to analyse a GA run on a pretty cheap computer. Even with storage, as long as you dont keep more than a couple of runs worth of raw data, you are looking at a relatively cheap system. Amortise that over the lifetime of the instrument (or number of runs) and it comes out pretty competitive at around $150 per run all in (yes with heat and power etc). (This, and yours, are a class of calculation that goes right back to the early days of Solexa &#8211; nothing new).</p>
<p>Even if cloud is a bit cheaper, we&#8217;re clearly not talking 10-100X &#8211; there&#8217;s a lot of wastage elsewhere in most operations that can drown out such small cost savings &#8211; like failed runs, libraries and reagents kits which can easily add up to many thousands (not to mention employee costs which are always the largest).</p>
<p>(There are also benefits to owning the hardware, like control &#8211; and Im not convinced that all of that software can be run concurrently, with many different resource usage profiles, in way that maintains the low costs and short execution times, at least not without a lot of re-writing of software. A lot of next-gen seq bioinf apps have heavy and demanding IO requirements, im not sure that can be generically abstracted in a way that ensures every user gets what they want all of the time&#8230;..)</p>
<p>c</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: MB</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15528</link>
		<dc:creator>MB</dc:creator>
		<pubDate>Wed, 25 Nov 2009 20:16:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15528</guid>
		<description>If most of biological applications could be done in cloud, cloud computing would be really promising. Unfortunately, there are too few cloud applications. If we want to do something on local computers (e.g. image analysis, base calling and post alignment analysis) and something else in cloud (e.g. alignment), why not run everything locally as we do not need too much more resources given the current advance in alignment algorithms? Furthermore, cloud computing greatly raises barrier for software development. Most developers would be reluctant to spend a lot of time on learning hadoop when they get their algorithms working locally, which deteriorates the situation.

In my view, cloud computing can only be popular when someone design a generic modular framework. In a simpler case of crossbow, I think it would be essential for it to allow other aligners/SNP callers to be plugged in. Crossbow can define the interfaces or the required command-line options and any aligners/SNP callers that implement this interface can run in a cloud. It would be even better to define a more generic interface for other applications such that a command-line tool can be used in a cloud. This will be harder, though.</description>
		<content:encoded><![CDATA[<p>If most of biological applications could be done in cloud, cloud computing would be really promising. Unfortunately, there are too few cloud applications. If we want to do something on local computers (e.g. image analysis, base calling and post alignment analysis) and something else in cloud (e.g. alignment), why not run everything locally as we do not need too much more resources given the current advance in alignment algorithms? Furthermore, cloud computing greatly raises barrier for software development. Most developers would be reluctant to spend a lot of time on learning hadoop when they get their algorithms working locally, which deteriorates the situation.</p>
<p>In my view, cloud computing can only be popular when someone design a generic modular framework. In a simpler case of crossbow, I think it would be essential for it to allow other aligners/SNP callers to be plugged in. Crossbow can define the interfaces or the required command-line options and any aligners/SNP callers that implement this interface can run in a cloud. It would be even better to define a more generic interface for other applications such that a command-line tool can be used in a cloud. This will be harder, though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben Langmead</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15521</link>
		<dc:creator>Ben Langmead</dc:creator>
		<pubDate>Wed, 25 Nov 2009 13:37:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15521</guid>
		<description>Hi David,

Analyses like the above are, I think, a great way of advancing the field&#039;s conversation about cloud computing.  I&#039;m really glad you&#039;re taking it up.

My main comment is that you&#039;re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers.  That isn&#039;t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn&#039;t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.

To someone facing larger barriers, the fact that computation costs so much less than the sequencing machine (or more importantly, the sequencing consumables) will probably push them toward the cloud rather than away.  Your situation is relatively special; The Genome Center has an existing, large pool of computational power, steady work, and a lot of like-minded sequencing people under one roof.  Academics don&#039;t necessarily have any of those things.

I&#039;ve only heard anecdotal accounts of people calculating recurring-cost comparisons for local vs. cloud, and I&#039;m told that cloud beats local by 2 or 3x.  That&#039;s secondhand, so I hope you&#039;ll try it yourself.

Thanks for the interest - I look forward to the future posts.

Best,
Ben</description>
		<content:encoded><![CDATA[<p>Hi David,</p>
<p>Analyses like the above are, I think, a great way of advancing the field&#8217;s conversation about cloud computing.  I&#8217;m really glad you&#8217;re taking it up.</p>
<p>My main comment is that you&#8217;re comparing the cloud cost against only at one type of cost: the one-time cost of buying new machines and adding them to your (already large; at least at Wash U) pool of computers.  That isn&#8217;t the only relevant number for a lot of people, especially those in smaller institutions and academic departments, because (a) there are recurring costs for electricity, cooling, space, and (b) there isn&#8217;t necessarily a huge pool of computers (and support staff, and space) to begin with, so the initial cost and effort barrier can be much larger than the cost of the machines per se.</p>
<p>To someone facing larger barriers, the fact that computation costs so much less than the sequencing machine (or more importantly, the sequencing consumables) will probably push them toward the cloud rather than away.  Your situation is relatively special; The Genome Center has an existing, large pool of computational power, steady work, and a lot of like-minded sequencing people under one roof.  Academics don&#8217;t necessarily have any of those things.</p>
<p>I&#8217;ve only heard anecdotal accounts of people calculating recurring-cost comparisons for local vs. cloud, and I&#8217;m told that cloud beats local by 2 or 3x.  That&#8217;s secondhand, so I hope you&#8217;ll try it yourself.</p>
<p>Thanks for the interest &#8211; I look forward to the future posts.</p>
<p>Best,<br />
Ben</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gary Stiehr</title>
		<link>http://www.politigenomics.com/2009/11/bioinformatics-and-cloud-computing.html/comment-page-1#comment-15511</link>
		<dc:creator>Gary Stiehr</dc:creator>
		<pubDate>Tue, 24 Nov 2009 21:15:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.politigenomics.com/?p=1728#comment-15511</guid>
		<description>David, you beat me to the financial analysis post!  Nice analysis.  One thing that might change the calculation is taking into account the cost on fewer EC2 cores.  As you point out, we may not need to finish in 2.8 hours.  

As you may have see in the last paragraph of my post (http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/) a significant premium is paid to get this done in 2.8 hours using 320 cores instead of 6.5 hours using 80 cores.  Due to the non-linear scaling, the cost per-hour goes from around $8 per hour to around $29 per hour of elapsed time (this is using the EC2 costs only whereas you are counting the storage costs as well in the $125).  

Also, lacking an analysis of the CPU usage efficiency on the EC2 nodes, one cannot necessarily say that we&#039;d need the same quantity of cores to complete the analysis in the same time frame.</description>
		<content:encoded><![CDATA[<p>David, you beat me to the financial analysis post!  Nice analysis.  One thing that might change the calculation is taking into account the cost on fewer EC2 cores.  As you point out, we may not need to finish in 2.8 hours.  </p>
<p>As you may have see in the last paragraph of my post (<a href="http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/" rel="nofollow">http://hpcinfo.com/2009/11/22/benchmarking-the-cloud-for-genomics/</a>) a significant premium is paid to get this done in 2.8 hours using 320 cores instead of 6.5 hours using 80 cores.  Due to the non-linear scaling, the cost per-hour goes from around $8 per hour to around $29 per hour of elapsed time (this is using the EC2 costs only whereas you are counting the storage costs as well in the $125).  </p>
<p>Also, lacking an analysis of the CPU usage efficiency on the EC2 nodes, one cannot necessarily say that we&#8217;d need the same quantity of cores to complete the analysis in the same time frame.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

