<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; storage</title>
	<atom:link href="http://www.politigenomics.com/tag/storage/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Next-Generation Sequencing Informatics Update</title>
		<link>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html</link>
		<comments>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html#comments</comments>
		<pubDate>Fri, 19 Feb 2010 21:55:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2143</guid>
		<description><![CDATA[I updated the Next-Generation Sequencing Informatics table a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the Illumina GA IIx. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;somewhat dated comparison of next-generation sequencing platforms.&#8221; Just [...]]]></description>
			<content:encoded><![CDATA[<p>I updated the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a> a few weeks ago but forgot to mention it on the blog. The main update was the 50G configuration of the <a href="http://www.illumina.com/systems/genome_analyzer_iix.ilmn">Illumina GA IIx</a>. Also, the Sides &#038; Associates blog linked to my table and referred to it as a &#8220;<a href="http://sidesandassociates.com/blog/2010/01/01/the-business-of-sequencing/">somewhat dated comparison of next-generation sequencing platforms</a>.&#8221; Just to clarify, this table represents <em>average</em> throughput for <em>production</em> systems; not vendor claims about throughput, not future vaporware (and Alejandro Gutierrez corrected his description in the post once I pointed this out). As new systems come online and further improvements are made to existing platforms, the table will be updated.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/02/next-generation-sequencing-informatics-update.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>What&#8217;s in an Illumina GA run directory?</title>
		<link>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html</link>
		<comments>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html#comments</comments>
		<pubDate>Wed, 28 Oct 2009 21:46:40 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1660</guid>
		<description><![CDATA[One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a lot of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This [...]]]></description>
			<content:encoded><![CDATA[<p>One of the main things that differentiates genomics from other endeavors that use a lot of disk space is that genomics file systems tend to have a <em>lot</em> of files (millions). This was true with Sanger sequencing, and it seems to be even more true with next-generation sequencing technologies, especially Illumina/Solexa and AB SOLiD. This large number of files and the parallel access of these files by large computational clusters tends to give most storage solutions great difficulty.</p>
<p>So what, exactly, is in an Illumina run directory? Well, to get breakdowns of file statistics there is a nifty little tool called <a href="http://www.pdsi-scidac.org/fsstats/">fsstats</a>. It is just a simple Perl script that crawls through a directory stat&#8217;ing files and reporting metrics. For example, when you run it on an Illumina GA IIx 2&times;100, high cluster density run after the primary analysis has completed, you get the following information about the distribution of file sizes. (I have rearranged and condensed the information to make it fit.)</p>
<pre style="font-size: x-small; line-height: normal;">
total 7.46 TB used to store 7.46 TB user data, overhead 0.04%
  count=991227 avg=8076.50 KB
  min=0.00 KB max=13128679.30 KB
           size range    count   %tot  %tot cum       total size   %tot  %tot cum
[       0-       2 KB):   4019 ( 0.41) (  0.41)       3009.03 KB ( 0.00) (  0.00)
[       2-       4 KB):      2 ( 0.00) (  0.41)          6.99 KB ( 0.00) (  0.00)
[       4-       8 KB):    981 ( 0.10) (  0.50)       5964.82 KB ( 0.00) (  0.00)
[       8-      16 KB): 193351 (19.51) ( 20.01)    2588619.88 KB ( 0.03) (  0.03)
[      16-      32 KB):   2656 ( 0.27) ( 20.28)      58586.79 KB ( 0.00) (  0.03)
[      32-      64 KB):    901 ( 0.09) ( 20.37)      31369.79 KB ( 0.00) (  0.03)
[      64-     128 KB):   2893 ( 0.29) ( 20.66)     303872.38 KB ( 0.00) (  0.04)
[     128-     256 KB):      2 ( 0.00) ( 20.66)        345.34 KB ( 0.00) (  0.04)
[     256-     512 KB):      4 ( 0.00) ( 20.66)       1222.53 KB ( 0.00) (  0.04)
[     512-    1024 KB):      1 ( 0.00) ( 20.66)        622.26 KB ( 0.00) (  0.04)
[    1024-    2048 KB):      2 ( 0.00) ( 20.66)       3199.89 KB ( 0.00) (  0.04)
[    2048-    4096 KB):     12 ( 0.00) ( 20.66)      41779.69 KB ( 0.00) (  0.04)
[    4096-    8192 KB): 776654 (78.35) ( 99.02) 5863161178.18 KB (73.24) ( 73.28)
[   16384-   32768 KB):     21 ( 0.00) ( 99.02)     487156.46 KB ( 0.01) ( 73.28)
[   32768-   65536 KB):   3856 ( 0.39) ( 99.41)  163552521.17 KB ( 2.04) ( 75.32)
[   65536-  131072 KB):   3825 ( 0.39) ( 99.79)  307535341.32 KB ( 3.84) ( 79.17)
[  131072-  262144 KB):    133 ( 0.01) ( 99.81)   32458046.12 KB ( 0.41) ( 79.57)
[  262144-  524288 KB):   1787 ( 0.18) ( 99.99)  658830514.46 KB ( 8.23) ( 87.80)
[ 2097152- 4194304 KB):     16 ( 0.00) ( 99.99)   47898262.36 KB ( 0.60) ( 88.40)
[ 4194304- 8388608 KB):     64 ( 0.01) (100.00)  432084134.39 KB ( 5.40) ( 93.80)
[ 8388608-16777216 KB):     47 ( 0.00) (100.00)  496603147.67 KB ( 6.20) (100.00)
</pre>
<p>So the total size of the run directory is nearly 7.5 TB and there are almost one million files. The average size of a file in the run directory is about 8 MB and the maximum size is over 13 GB. The images (represented in the 4096-8192 KB range), comprise over 78% of the files and 73% of the total size of the run directory. This significant penalty can be avoided by using RTA and not transferring image files. The largest files are the alignment (ELAND) outputs and the FASTQ files in the GERALD directory. Speaking of directories, here is a breakdown by number of files in each directory.</p>
<pre style="font-size: x-small; line-height: normal;">
  count=1652 avg=601.02 ents
  min=0.00 ents max=24720.00 ents
              range   count   %tot  %tot cum total ent   %tot  %tot cum
  [    0-    1 ents]:     4 ( 0.24) (  0.24)      0.00 ( 0.00) (  0.00)
  [    2-    3 ents]:     1 ( 0.06) (  0.30)      2.00 ( 0.00) (  0.00)
  [    8-   15 ents]:     3 ( 0.18) (  0.48)     26.00 ( 0.00) (  0.00)
  [   16-   31 ents]:     2 ( 0.12) (  0.61)     44.00 ( 0.00) (  0.01)
  [  128-  255 ents]:     9 ( 0.54) (  1.15)   1826.00 ( 0.18) (  0.19)
  [  256-  511 ents]:  1616 (97.82) ( 98.97) 775680.00 (78.12) ( 78.32)
  [  512- 1023 ents]:     3 ( 0.18) ( 99.15)   2920.00 ( 0.29) ( 78.61)
  [ 1024- 2047 ents]:     4 ( 0.24) ( 99.39)   7845.00 ( 0.79) ( 79.40)
  [ 2048- 4095 ents]:     2 ( 0.12) ( 99.52)   6775.00 ( 0.68) ( 80.08)
  [16384-32767 ents]:     8 ( 0.48) (100.00) 197760.00 (19.92) (100.00)
</pre>
<p>The picture for directory entries is a bit muddled since most of the directories are organized around a small multiple of the number of tiles per lane, falling in the 256-511 entries range. The directories in the 16384-32767 entries range? The image analysis (Firecrest) Temp/L00[1-8] directories, each with 24,720 entries (four <code>clu.txt</code> per tile (one per color) and one <code>qcm.xml</code> (XML!) file for each cycle for each tile in a lane).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/whats-in-an-illumina-ga-run-directory.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics table update</title>
		<link>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html</link>
		<comments>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html#comments</comments>
		<pubDate>Mon, 05 Oct 2009 14:45:02 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1606</guid>
		<description><![CDATA[I have made some updates to the Next-Generation Sequencing Informatics table. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking [...]]]></description>
			<content:encoded><![CDATA[<p>I have made some updates to the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a>. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking at you drd).</p>
<p><strong>Update:</strong> I received some SOLiD 3 number from Nicholas Socci (thanks Nicholas!).</p>
<p><strong>Update2:</strong> I received a fuller set of numbers from drd and the SOLiD 3 column is complete (thanks drd!).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Expansion</title>
		<link>http://www.politigenomics.com/2009/10/expansion.html</link>
		<comments>http://www.politigenomics.com/2009/10/expansion.html#comments</comments>
		<pubDate>Fri, 02 Oct 2009 19:04:04 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[data center]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=114</guid>
		<description><![CDATA[The Genome Data Center has received a Gold LEED Certification from the U.S. Green Building Council. This is in addition to the Keystone Award from the St. Louis Association of General Contractors. It is quite an achievement for a power hungry data center to receive a LEED certification, much more a Gold Certification, but the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.usgbc.org/DisplayPage.aspx?CMSPageID=1991"><img src="http://www.politigenomics.com/wp-content/uploads/2009/10/leed.png" alt="LEED Certification" title="LEED Certification" width="275" height="393" class="alignright size-full wp-image-1581" /></a></p>
<p>The Genome Data Center has received a <a href="http://www.usgbc.org/DisplayPage.aspx?CategoryID=19">Gold LEED Certification</a> from the <a href="http://www.usgbc.org/">U.S. Green Building Council</a>. This is in addition to the <a href="http://www.politigenomics.com/2008/11/keystone-award-for-data-center.html">Keystone Award</a> from the St. Louis Association of General Contractors. It is quite an achievement for a power hungry data center to receive a LEED certification, much more a Gold Certification, but the WUSM Design and Construction team along with the architects, engineers, and contractors were able to pull it off.</p>
<p>Recently the final phase of construction at the Genome Data Center was completed. The initial build out had enough power and cooling for about 40 racks of equipment. Now at full capacity, the data center is capable of supplying 4 MW of power (about the amount used by 800 homes on a hot day) and the requisite cooling to the equipment housed within it. This will support over 100 racks worth of high-density computational (blades) and storage equipment and its supporting infrastructure (chilled water plants, air handlers, humidity control, office space, etc.). The electrical system is completely redundant, all the way to the double-ended substation of our electrical utility. That means even if we lose one entire electrical feed, we can still operate on utility power. If we lose both electrical feeds, we have battery and fly-wheel UPS systems to carry us until the two 2 MW diesel generators start (under 10 seconds). <a href="http://www.flickr.com/photos/ddgenome/3730196210/" title="2 MW diesel generator"><img src="http://farm4.static.flickr.com/3428/3730196210_cdb8bfeb18.jpg" width="452" height="339" alt="generator" /></a> The building is about 1480 m<sup>2</sup> while the actual data center is about 288 m<sup>2</sup> (as they shrink computing equipment, the required electrical and cooling equipment keeps increasing in size). The data center is arranged in a standard hot aisle/cold aisle layout with cooling delivered from below through floor grates (perf plates did not provide enough airflow) via a 1.2 m raised floor. <a href="http://www.flickr.com/photos/ddgenome/3730194974/" title="data center cold aisle"><img src="http://farm3.static.flickr.com/2565/3730194974_8fddfc00b1.jpg" width="452" height="339" alt="cold aisle" /></a> We currently have about 3,000 cores in our computational cluster and over 3 PB (3,000,000 GB) of storage online. When full of equipment in a few years, the data center will likely house tens of thousands of cores and on the order of 100 PB of storage.</p>
<p>There are more pictures of the Genome Data Center on <a href="http://www.flickr.com/photos/22486047@N03/sets/72157603633991423/">Flickr</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/expansion.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data intensive science</title>
		<link>http://www.politigenomics.com/2009/03/data-intensive-science.html</link>
		<comments>http://www.politigenomics.com/2009/03/data-intensive-science.html#comments</comments>
		<pubDate>Fri, 06 Mar 2009 23:00:12 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=972</guid>
		<description><![CDATA[There is an interesting article, Beyond the Data Deluge, on Science that discusses how various scientific disciplines are facing massive increases in data generation rates and how they are dealing with it. The article has several useful references for those interested in learning more. It seems the authors worked closely with Jim Gray, who went [...]]]></description>
			<content:encoded><![CDATA[<p>There is an interesting article, <a href="http://www.sciencemag.org/cgi/content/full/323/5919/1297">Beyond the Data Deluge</a>, on Science that discusses how various scientific disciplines are facing massive increases in data generation rates and how they are dealing with it. The article has several useful references for those interested in learning more. It seems the authors worked closely with <a href="http://research.microsoft.com/en-us/um/people/gray/">Jim Gray</a>, who <a href="http://research.microsoft.com/news/featurestories/publish/Gray.aspx">went missing</a> while sailing about two years ago.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/03/data-intensive-science.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ensembl on Amazon</title>
		<link>http://www.politigenomics.com/2008/12/ensembl-on-amazon.html</link>
		<comments>http://www.politigenomics.com/2008/12/ensembl-on-amazon.html#comments</comments>
		<pubDate>Fri, 05 Dec 2008 15:46:08 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=668</guid>
		<description><![CDATA[The Amazon Web Services (AWS) blog has an entry on using Amazon&#8217;s Elastic Compute Cloud (EC2) to host and access public data sets, including Ensembl release 51. The data are stored as Amazon Elastic Block Store (Amazon EBS) snapshots. Anyone using EC2 can then create their own EBS using the public data EBS as a [...]]]></description>
			<content:encoded><![CDATA[<p>The Amazon Web Services (AWS) blog has an entry on using <a href="http://aws.typepad.com/aws/2008/12/paging-researchers-analysts-and-developers.html">Amazon&#8217;s Elastic Compute Cloud (EC2) to host and access public data sets</a>, including <a href="http://www.ensembl.org/">Ensembl</a> release 51. The data are stored as Amazon Elastic Block Store (Amazon EBS) snapshots. Anyone using EC2 can then create their own EBS using the public data EBS as a starting point. The data are then available to the user to modify, update, and perform calculations using the cloud. You can find more information on how to use the available public data sets and even upload your own data sets at <a href="http://aws.amazon.com/publicdatasets/">Public Data Sets on AWS</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/12/ensembl-on-amazon.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics</title>
		<link>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html</link>
		<comments>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html#comments</comments>
		<pubDate>Thu, 04 Dec 2008 22:15:59 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=660</guid>
		<description><![CDATA[I have put together a table with a bunch of important metrics for the major next-generation sequencing platforms: Next-Generation Sequencing Informatics (there is also a link on the left-hand side of the page). It includes number of reads, read length, data sizes, computational time, etc. I will try to keep it as up to date [...]]]></description>
			<content:encoded><![CDATA[<p>I have put together a table with a bunch of important metrics for the major next-generation sequencing platforms: <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics</a> (there is also a link on the left-hand side of the page). It includes number of reads, read length, data sizes, computational time, etc. I will try to keep it as up to date as I can and add new platforms and revisions as they become available. Consider it an early Christmas present.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Living data storage</title>
		<link>http://www.politigenomics.com/2008/07/living-data-storage.html</link>
		<comments>http://www.politigenomics.com/2008/07/living-data-storage.html#comments</comments>
		<pubDate>Tue, 08 Jul 2008 13:22:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=102</guid>
		<description><![CDATA[Researchers in Japan have created the first DNA molecule made from entirely artificially created bases. While some are trying to manufacture artificial life by creating DNA molecules in the lab, the aim of this research is to take advantage of, and in this case expand on, DNA&#8217;s high information density to create a very dense [...]]]></description>
			<content:encoded><![CDATA[<p>Researchers in Japan have created the <a href="http://www.sciencedaily.com/releases/2008/07/080707091915.htm">first DNA molecule made from entirely artificially created bases</a>. While some are trying to manufacture artificial life by creating DNA molecules in the lab, the aim of this research is to take advantage of, and in this case expand on, DNA&#8217;s high information density to create a very dense data storage platform. Natural DNA has four different bases (A, G, C, and T) it can use to encode information. To expand DNA&#8217;s ability to encode information, these researchers created four new bases and integrated them in to a DNA molecule.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/07/living-data-storage.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>N Genomes</title>
		<link>http://www.politigenomics.com/2008/05/n-genomes.html</link>
		<comments>http://www.politigenomics.com/2008/05/n-genomes.html#comments</comments>
		<pubDate>Fri, 09 May 2008 20:21:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[CSHL]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=74</guid>
		<description><![CDATA[Earlier this week there were several meetings about the 1000 Genomes Project at Cold Spring Harbor Labs. The first meeting Monday morning was about data flow and data repositories. NCBI&#8217;s Short Read Archive (SRA) and the equivalent at EBI (which should be ready in a month or two) will house all the data. The pilot [...]]]></description>
			<content:encoded><![CDATA[<p>Earlier this week there were several meetings about the <a href="http://www.1000genomes.org/">1000 Genomes Project</a> at <a href="http://www.cshl.edu/">Cold Spring Harbor Labs</a>.  The first meeting Monday morning was about data flow and data repositories.  NCBI&#8217;s <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi">Short Read Archive (SRA)</a> and the equivalent at <a href="http://www.ebi.ac.uk/">EBI</a> (which should be ready in a month or two) will house all the data.  The <a href="http://www.politigenomics.com/2008/03/1000-genomes.html">pilot projects</a> for the 1000 Genomes Project just started less than two months ago and have already generated as much sequence data as half of the entire <a href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi">trace archive</a> (which contains the sequence data for all publicly funded genome projects over the last 10 years).    In other words, this project is going to generate a <span style="font-weight:bold;">lot</span> of sequence data (not to mention all the data generated by analysis of the sequence).  Paul Flicek from EBI estimates the pilot projects alone will generate about 1 PT (1,000,000 GB) of sequence data.  Moving that much data from site to site will be a challenge.  Normal solutions, e.g., FTP, rsync, and shipping hard drives, can&#8217;t seem to keep up with the data generation rates.  NCBI, EBI, and the sequencing centers are testing a high-speed data transfer solution called <a href="http://www.asperasoft.com/products/scp/index.html">Aspera scp</a>.  It has impressive transfer rates, but seems to stall after a while for no discernible reason.  We&#8217;ll see if we can get it to work reliably over the coming weeks.</p>
<p>After the data flow meeting was a meeting of the 1000 Genomes Steering Committee.  The day and a half that ensued was filled with a lot of lively discussion.  When all was said and done, one thing was clear: there are a lot of questions that need to be answered.  The analysis group presented convincing results from simulations that indicated 2× coverage in a large number of individual genomes (Pilot 1) is probably not sufficient to detect the rare variants the project is going after (present in 1-2% of the population).  The simulations indicated that the power of the study to detect such variants (at a constant cost, i.e., constant total amount of sequence generated) would be greatly enhanced by sequencing half as many people at 4× coverage.  There was no firm decision on how to change the pilot (if at all), but going forward it is likely that some of the individuals in Pilot 1 will be sequenced up to 4× or even 8×.  Thus, while the project may be named 1000 Genomes, exactly how many genomes we are going to sequence is yet to be determined.</p>
<p>Another issue that arose was the rapid development of the massively parallel sequencing technologies.  These platforms (454 FLX, Illumina Genome Analyzer, and AB SOLiD) increase their throughput, improve their data quality, improve analysis software, etc. several times each year.  Such dynamic platforms make the development of tools to analyze their data, e.g., align the data to a reference genome and detect variants, very difficult.  The right platforms and tools today may not be the best next month or next year when the main project gets underway.  This causes two major needs to come to the fore.  First, experimental design will not end when the project starts.  The experiment will need to be adjusted as capabilities and capacities change.  Second, we will not only have to continually develop and refine tools throughout the project, we will need to develop frameworks to continually test and compare the tools that are available.  It&#8217;s always fun to hit a moving target.</p>
<p>The meeting also discussed the ethical, legal, and social implications (ELSI) of the project.  This discussion largely focused on which populations to sample for the project.  Should we deepen our knowledge of individuals of Central European, African, and East Asian ancestry to aid in methodology development?  Or should we broaden our knowledge of overall human variation by including fewer individuals from a larger number of populations?  To be determined&hellip;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/05/n-genomes.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Lustre bluster</title>
		<link>http://www.politigenomics.com/2008/02/lustre-bluster.html</link>
		<comments>http://www.politigenomics.com/2008/02/lustre-bluster.html#comments</comments>
		<pubDate>Mon, 18 Feb 2008 19:18:00 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[IT]]></category>
		<category><![CDATA[Lustre]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://localhost/wordpress/?p=23</guid>
		<description><![CDATA[What is Lustre? It is a cluster file system developed by Cluster File Systems, Inc., which was recently purchased by Sun. We do not have much experience with Lustre but we have played around a bit with it in house and I have talked a lot with other centers who do use it (like Sanger). [...]]]></description>
			<content:encoded><![CDATA[<p>What is <a href="http://wiki.lustre.org/index.php?title=Main_Page">Lustre</a>? It is a cluster file system developed by Cluster File Systems, Inc., which was <a href="http://www.sun.com/software/clusterfs/index.xml">recently purchased</a> by <a href="http://www.sun.com/">Sun</a>.  We do not have much experience with Lustre but we have played around a bit with it in house and I have talked a lot with other centers who do use it (like <a href="http://www.sanger.ac.uk/">Sanger</a>).  The main advantages of Lustre are that it is very scalable and can sustain very high performance.  The two main problems with Lustre is that it is extremely difficult to get working and it is very difficult to back up your data on a Lustre file system.  The difficulty in getting it to work seems to be a business strategy of Cluster File Systems.  It is hard to get paid for installation and configuration consulting if you have a well-documented, easy to perform process.  It will be interesting to see if this will change given <a href="http://blogs.sun.com/jonathan/">Sun CEO Jonathan Schwartz</a>&#8216;s well-publicized commitment to open source software.  The second difficulty, back up, may be a bit of a red herring.  Do you really want to put archive-worthy data on a high-performance, clustered, 20 TB file systems that depends on a metadata server for file system integrity?  Lustre file systems should really be used for high-performance scratch space.  Once the calculations are done, move the data somewhere else.  Of course, if you have a lot of data, this may take a while.  So you may want to move it two places: on-line and off-line storage.  Because if you have a lot of data to move, it will also be a lot of data to back up.  So you may just want to dump it to tape one time while you are dumping it to your on-line storage.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/02/lustre-bluster.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

