<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; SOLiD</title>
	<atom:link href="http://www.politigenomics.com/tag/solid/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Life finds a way</title>
		<link>http://www.politigenomics.com/2010/01/life-finds-a-way.html</link>
		<comments>http://www.politigenomics.com/2010/01/life-finds-a-way.html#comments</comments>
		<pubDate>Fri, 29 Jan 2010 22:43:25 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=2060</guid>
		<description><![CDATA[Earlier this week Life Technologies announced the next revision of their SOLiD platform, SOLiD 4. I don&#8217;t have all the details that I had for the Illumina HiSeq 2000, but here is what I do know: the system will produced 100 Gb of alignable sequence data on two slides per 14 day run. The sequence [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.appliedbiosystems.com/solid4"><img alt="SOLiD 4" src="http://www3.appliedbiosystems.com/cms/groups/portal/documents/web_content/cms_076478.jpg" title="SOLiD 4" class="alignright" width="200" height="205" /></a></p>
<p>Earlier this week <a href="http://www.lifetechnologies.com/">Life Technologies</a> announced the next revision of their SOLiD platform, <a href="http://www.lifetechnologies.com/life-technologies-brings-genomic-sequencing-closer-clinic.html">SOLiD 4</a>. I don&#8217;t have all the details that I had for the <a href="http://www.politigenomics.com/2010/01/hiseq-2000.html">Illumina HiSeq 2000</a>, but here is what I do know: the system will produced 100 Gb of alignable sequence data on two slides per 14 day run. The sequence data will be paired-end, 50&times;35 base reads. Reagent costs for each run will be about $6,000. Since you need about 100 Gb of sequence to sequence a human genome, you&#8217;re looking at about $6000 in reagent costs per human genome. They also indicated that capacity for the instrument will increase to 300 Gb per run and the cost for reagents per human genome will be less than $3000 by the end of 2010. In comparison, the Illumina HiSeq 2000 reagent costs will be about $10,000 per human genome at its release with, by <em>my</em> calculations, a path to about $4000 per human genome (I have no idea what the time frame might be to reach the end of that path, but given this announcement by Life, it will likely be aggressive). You have to love the way competition drives down costs. Similar to Illumina&#8217;s announcement of a big HiSeq 2000 purchase at its announcement, Life announced that <a href="http://www.lifetechnologies.com/life-technologies-and-ignite-institute-partner-create-largest-next-generation-genomic-sequencing-fac">Ignite Institute would acquire 100 SOLiD 4 instruments</a> as part of partnership with Life. Life also announced a major bioinformatics investment program as well as a physician education program through their Foundation.</p>
<p><strong>Update:</strong> According to the press release, Ignite is &#8220;acquiring&#8221;, not purchasing, the instruments in &#8220;partnership&#8221; with Life. So it appears this is not an outright purchase of a large number of instruments. I have updated the text in the post to be more accurate.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2010/01/life-finds-a-way.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics table update</title>
		<link>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html</link>
		<comments>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html#comments</comments>
		<pubDate>Mon, 05 Oct 2009 14:45:02 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1606</guid>
		<description><![CDATA[I have made some updates to the Next-Generation Sequencing Informatics table. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking [...]]]></description>
			<content:encoded><![CDATA[<p>I have made some updates to the <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics table</a>. Specifically, I have updated the numbers for 454 Ti, including paired-end information, and added information on the Illumina GA IIx. If anyone that is not employed by AB has real-world numbers for SOLiD 3, I&#8217;d appreciate you passing them along to me (I&#8217;m looking at you drd).</p>
<p><strong>Update:</strong> I received some SOLiD 3 number from Nicholas Socci (thanks Nicholas!).</p>
<p><strong>Update2:</strong> I received a fuller set of numbers from drd and the SOLiD 3 column is complete (thanks drd!).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/next-generation-sequencing-informatics-table-update.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>My secret past</title>
		<link>http://www.politigenomics.com/2009/09/my-secret-past.html</link>
		<comments>http://www.politigenomics.com/2009/09/my-secret-past.html#comments</comments>
		<pubDate>Wed, 16 Sep 2009 15:53:01 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1537</guid>
		<description><![CDATA[Now everyone will know about my secret past before I joined The Genome Center: David Dooling: Gangbusters at the Genome Center. Bio-IT World also has a nice interview with Clive Brown of Oxford Nanopore, whom I first described as the most honest guy in all of next-gen sequencing. By the way, sorry for the extended [...]]]></description>
			<content:encoded><![CDATA[<p>Now everyone will know about my secret past before I joined The Genome Center: <a href="http://www.bio-itworld.com/2009/09/16/NGS-dooling.html">David Dooling: Gangbusters at the Genome Center</a>. Bio-IT World also has a nice <a href="http://www.bio-itworld.com/NGS-Brown.html">interview with Clive Brown</a> of <a href="http://www.nanoporetech.com/">Oxford Nanopore</a>, whom <em>I</em> first described as the <a href="http://www.politigenomics.com/2009/08/another-rich-white-guy-sequences-own-genome.html">most honest guy in all of next-gen sequencing</a>.</p>
<p>By the way, sorry for the extended absence, things have been crazy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/09/my-secret-past.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sour grapes</title>
		<link>http://www.politigenomics.com/2009/08/sour-grapes.html</link>
		<comments>http://www.politigenomics.com/2009/08/sour-grapes.html#comments</comments>
		<pubDate>Mon, 10 Aug 2009 15:13:13 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1425</guid>
		<description><![CDATA[Well, the US is not the only place with interesting politics. I recently came across this letter from Kevin McKernan, Senior Director of Scientific Operations at Applied Biosystems/Life Technologies, to the House of Lords in the UK (pdf). In the letter, McKernan expresses his concern that the Sanger Institute&#8216;s decision to return their SOLiD instruments [...]]]></description>
			<content:encoded><![CDATA[<p>Well, the US is not the only place with interesting politics. I recently came across this <a href="http://www.parliament.uk/documents/upload/101stGMAppliedBiosystems.pdf">letter from Kevin McKernan, Senior Director of Scientific Operations at Applied Biosystems/Life Technologies, to the House of Lords in the UK (pdf)</a>. In the letter, McKernan expresses his concern that the <a href="http://www.sanger.ac.uk/">Sanger Institute</a>&#8216;s decision to <a href="http://www.genomeweb.com/sequencing/sanger-institute-returns-five-solids-life-technologies">return their SOLiD instruments</a> was due to some long-standing resentment of Applied Biosystems due to their association with Craig Venter and his challenge to the <a href="http://www.genome.gov/10001772">Human Genome Project</a>. Obviously there could be no valid scientific reason for their actions. And clearly the House of Lords is in the best position to establish that fact and rectify the situation. Sure, the Sanger Institute receives its funding from the <a href="http://www.wellcome.ac.uk/">Wellcome Trust</a>, an <a href="http://www.wellcome.ac.uk/About-us/index.htm">independent charity</a>, but even if the House of Lords can&#8217;t pull their funding, they can always push an antitrust investigation, right?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/08/sour-grapes.html/feed</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Sequencing — the past, present, and future</title>
		<link>http://www.politigenomics.com/2009/04/sequencing-%e2%80%94-the-past-present-and-future.html</link>
		<comments>http://www.politigenomics.com/2009/04/sequencing-%e2%80%94-the-past-present-and-future.html#comments</comments>
		<pubDate>Tue, 21 Apr 2009 19:49:21 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[PacBio]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1063</guid>
		<description><![CDATA[Science Magazine has a nice article, Sanger Who? Sequencing the Next Generation, describing past sequencing technology, the current &#8220;next-generation&#8221; sequencing instruments and their capabilities, and several of the companies working to become the next big thing in sequencing. If you are interested in learning, at a high level, how each of the technologies work and [...]]]></description>
			<content:encoded><![CDATA[<p>Science Magazine has a nice article, <a href="http://www.sciencemag.org/products/lst_20090410.dtl">Sanger Who? Sequencing the Next Generation</a>, describing past sequencing technology, the current &#8220;next-generation&#8221; sequencing instruments and their capabilities, and several of the companies working to become the next big thing in sequencing. If you are interested in learning, at a high level, how each of the technologies work and how they compare to each other, it is worth a read.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/04/sequencing-%e2%80%94-the-past-present-and-future.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Next-Generation Sequencing Informatics</title>
		<link>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html</link>
		<comments>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html#comments</comments>
		<pubDate>Thu, 04 Dec 2008 22:15:59 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=660</guid>
		<description><![CDATA[I have put together a table with a bunch of important metrics for the major next-generation sequencing platforms: Next-Generation Sequencing Informatics (there is also a link on the left-hand side of the page). It includes number of reads, read length, data sizes, computational time, etc. I will try to keep it as up to date [...]]]></description>
			<content:encoded><![CDATA[<p>I have put together a table with a bunch of important metrics for the major next-generation sequencing platforms: <a href="http://www.politigenomics.com/next-generation-sequencing-informatics">Next-Generation Sequencing Informatics</a> (there is also a link on the left-hand side of the page). It includes number of reads, read length, data sizes, computational time, etc. I will try to keep it as up to date as I can and add new platforms and revisions as they become available. Consider it an early Christmas present.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/12/next-generation-sequencing-informatics.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>SOLiD2SRF</title>
		<link>http://www.politigenomics.com/2008/09/solid2srf.html</link>
		<comments>http://www.politigenomics.com/2008/09/solid2srf.html#comments</comments>
		<pubDate>Thu, 18 Sep 2008 16:57:10 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[FLOSS]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=307</guid>
		<description><![CDATA[In a previous post, I discussed the sequence read format (SRF), the implementation of the SRF specification for Illumina/Solexa data, and potential implementations for Roche/454 and AB SOLiD data. Well, the potential AB SOLiD implementation has become an actual implementation with the release of solid2srf version 0.6.6. The software is released under the Applied Biosystems [...]]]></description>
			<content:encoded><![CDATA[<p>In a previous post, I discussed the <a href="http://www.politigenomics.com/2008/06/whats-in-an-srf.html">sequence read format (SRF)</a>, the implementation of the <a href="http://srf.sourceforge.net/">SRF specification</a> for <a href="https://sourceforge.net/project/showfiles.php?group_id=100316&#038;package_id=108243">Illumina/Solexa data</a>, and potential implementations for Roche/454 and AB SOLiD data. Well, the potential AB SOLiD implementation has become an actual implementation with the release of <a href="http://solidsoftwaretools.com/gf/project/srf/">solid2srf</a> version 0.6.6. The software is released under the <a href="http://download.solidsoftwaretools.com/license/SIPP_LICENSE.pdf">Applied Biosystems SOLiD&trade; Tools Software License (Unsupported) license [pdf]</a>, which appears to be an open source license. Quoting from the license:<br />
<blockquote>You may reproduce and distribute copies of the Software or derivative works thereof in any medium, with or without modifications, and in source or object form. You may add your own copyright statement to your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided your use, reproduction, and distribution of the Software otherwise complies with the conditions stated in this Agreement.</p></blockquote>
<p> Despite the open source license, you do have to register on their GForge site to download the software.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/09/solid2srf.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What&#8217;s in an SRF?</title>
		<link>http://www.politigenomics.com/2008/06/whats-in-an-srf.html</link>
		<comments>http://www.politigenomics.com/2008/06/whats-in-an-srf.html#comments</comments>
		<pubDate>Mon, 30 Jun 2008 21:21:20 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=100</guid>
		<description><![CDATA[I have written a bit about the NCBI Short Read Archive (SRA), its internals, and data transfer rates. Here is some information about the data format people are using to submit data from the massively parallel sequencers to the SRA. I apologize in advance for all the acronyms. The SRA is currently accepting 454 data [...]]]></description>
			<content:encoded><![CDATA[<p>I have written a bit about the <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi">NCBI Short Read Archive (SRA)</a>, <a href="http://www.politigenomics.com/2008/05/short-read-archive.html">its internals</a>, and <a href="http://www.politigenomics.com/2008/06/how-fast.html">data transfer rates</a>. Here is some information about the data format people are using to submit data from the massively parallel sequencers to the SRA. I apologize in advance for all the acronyms.</p>
<p>The SRA is currently accepting 454 data in <a href="http://www.454.com/news-events/press-releases.asp?display=detail&#038;id=48">standard flowgram format (SFF)</a> and Solexa in <a href="http://srf.sourceforge.net/">SRF</a> format.  Soon 454 and AB SOLiD will support the SRF format and submissions will commence in that format for those platforms.  The SFF format contains the flowgrams (intensity per cycle at each spot), base calls, and base quality values.  In other words, the SFF is very similar to the SCF format used for capillary sequencing data (except flowgrams are discrete whereas chromatograms are continuous).  Also, NCBI (as <a href="http://www.politigenomics.com/2008/05/short-read-archive.html">recently discussed</a>) has developed their own storage format for massively parallel sequencing data that they will also be accepting as a submission format within the next few months.</p>
<p>So what is an SRF? Well, it is basically just a container format, i.e., what you store in it is up to the implementation.  Thus far, SRF has only been implemented for Illumina/Solexa data; so the rest of this post is specific to that platform and the data types that its implementation of the SRF format contains. The Solexa SRF implementation was done largely by James Bonfield at <a href="http://www.sanger.ac.uk/">Sanger</a> and is distributed as part of the <a href="https://sourceforge.net/project/showfiles.php?group_id=100316&#038;package_id=108243">io_lib</a> package (now distributed separately from the <a href="http://staden.sourceforge.net/">Staden package</a>).  I would imagine that the SOLiD implementation will be very similar to the Solexa implementation.  The 454 implementation will likely be very similar to the SFF already in wide use.</p>
<p>For the <a href="http://www.1000genomes.org/">1000 Genomes</a> <a href="http://www.politigenomics.com/2008/03/1000-genomes.html">pilot projects</a>, the 1000 Genomes Data Collection Center (DCC) is asking that we submit the &#8220;raw&#8221;, &#8220;processed&#8221;, and &#8220;base&#8221; data for each spot.  Raw data are the intensity values (int) and noise (nse) values.  Processed data are the processed intensity values (sig2) and four-channel quality values (prb).  Base data are the base calls (the quality value is gotten from the prb for the called base).  This results in about 50 bytes per base for the SRF. Compared to 2 bits per base, the minimum possible for DNA&#8217;s four letter alphabet, this is a 200-fold increase.  So not only do these instruments generate a lot more data, we are storing more information per base now too.  The average submission for an Solexa run is about 100 GB.</p>
<p>Why store all this extra information?  Essentially, people do not trust/believe the data at this point.  The quality values provided by these pipelines are not as reliable as those generated for capillary sequence data.  Some people want the raw data so that they can develop and improve base calling/quality algorithms. Clearly you would not need <em>all</em> the 1000 Genomes data to develop such algorithms (although the technology changes at such a rate that you would likely want some rolling subset of the latest runs). Others want the raw data because they think they may want to go back and re-analyze data when better algorithms become available. For a wide variety of reasons (disk space, computational cost, network bandwidth, keeping pace with newly generated data), I doubt any such massive re-analysis will ever take place.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/06/whats-in-an-srf.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How much?</title>
		<link>http://www.politigenomics.com/2008/06/how-much.html</link>
		<comments>http://www.politigenomics.com/2008/06/how-much.html#comments</comments>
		<pubDate>Fri, 20 Jun 2008 18:11:06 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=97</guid>
		<description><![CDATA[The Genome Center recently published a paper entitled Aspects of coverage in medical DNA sequencing that develops a model for diploid sequence coverage using data from massively parallel sequencing platforms (454, Solexa, SOLiD). It uses a known yardstick, 8&#215; BAC or WGS coverage with capillary sequencing, to establish the equivalent coverage for the new sequencing [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://genome.wustl.edu/">The Genome Center</a> recently published a paper entitled <a href="http://www.biomedcentral.com/1471-2105/9/239">Aspects of coverage in medical DNA sequencing</a> that develops a model for diploid sequence coverage using data from massively parallel sequencing platforms (454, Solexa, SOLiD). It uses a known yardstick, 8&times; BAC or WGS coverage with capillary sequencing, to establish the equivalent coverage for the new sequencing platforms. It turns out you need about 20&times; to 30&times; redundancy using these new platforms to obtain the equivalent amount of information as 10&times; coverage with capillary sequencing. The paper is published in an open access journal, <a href="http://www.biomedcentral.com/bmcbioinformatics">BMC Bioinformatics</a>, so enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/06/how-much.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>N Genomes</title>
		<link>http://www.politigenomics.com/2008/05/n-genomes.html</link>
		<comments>http://www.politigenomics.com/2008/05/n-genomes.html#comments</comments>
		<pubDate>Fri, 09 May 2008 20:21:23 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[CSHL]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[SOLiD]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=74</guid>
		<description><![CDATA[Earlier this week there were several meetings about the 1000 Genomes Project at Cold Spring Harbor Labs. The first meeting Monday morning was about data flow and data repositories. NCBI&#8217;s Short Read Archive (SRA) and the equivalent at EBI (which should be ready in a month or two) will house all the data. The pilot [...]]]></description>
			<content:encoded><![CDATA[<p>Earlier this week there were several meetings about the <a href="http://www.1000genomes.org/">1000 Genomes Project</a> at <a href="http://www.cshl.edu/">Cold Spring Harbor Labs</a>.  The first meeting Monday morning was about data flow and data repositories.  NCBI&#8217;s <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi">Short Read Archive (SRA)</a> and the equivalent at <a href="http://www.ebi.ac.uk/">EBI</a> (which should be ready in a month or two) will house all the data.  The <a href="http://www.politigenomics.com/2008/03/1000-genomes.html">pilot projects</a> for the 1000 Genomes Project just started less than two months ago and have already generated as much sequence data as half of the entire <a href="http://www.ncbi.nlm.nih.gov/Traces/trace.cgi">trace archive</a> (which contains the sequence data for all publicly funded genome projects over the last 10 years).    In other words, this project is going to generate a <span style="font-weight:bold;">lot</span> of sequence data (not to mention all the data generated by analysis of the sequence).  Paul Flicek from EBI estimates the pilot projects alone will generate about 1 PT (1,000,000 GB) of sequence data.  Moving that much data from site to site will be a challenge.  Normal solutions, e.g., FTP, rsync, and shipping hard drives, can&#8217;t seem to keep up with the data generation rates.  NCBI, EBI, and the sequencing centers are testing a high-speed data transfer solution called <a href="http://www.asperasoft.com/products/scp/index.html">Aspera scp</a>.  It has impressive transfer rates, but seems to stall after a while for no discernible reason.  We&#8217;ll see if we can get it to work reliably over the coming weeks.</p>
<p>After the data flow meeting was a meeting of the 1000 Genomes Steering Committee.  The day and a half that ensued was filled with a lot of lively discussion.  When all was said and done, one thing was clear: there are a lot of questions that need to be answered.  The analysis group presented convincing results from simulations that indicated 2× coverage in a large number of individual genomes (Pilot 1) is probably not sufficient to detect the rare variants the project is going after (present in 1-2% of the population).  The simulations indicated that the power of the study to detect such variants (at a constant cost, i.e., constant total amount of sequence generated) would be greatly enhanced by sequencing half as many people at 4× coverage.  There was no firm decision on how to change the pilot (if at all), but going forward it is likely that some of the individuals in Pilot 1 will be sequenced up to 4× or even 8×.  Thus, while the project may be named 1000 Genomes, exactly how many genomes we are going to sequence is yet to be determined.</p>
<p>Another issue that arose was the rapid development of the massively parallel sequencing technologies.  These platforms (454 FLX, Illumina Genome Analyzer, and AB SOLiD) increase their throughput, improve their data quality, improve analysis software, etc. several times each year.  Such dynamic platforms make the development of tools to analyze their data, e.g., align the data to a reference genome and detect variants, very difficult.  The right platforms and tools today may not be the best next month or next year when the main project gets underway.  This causes two major needs to come to the fore.  First, experimental design will not end when the project starts.  The experiment will need to be adjusted as capabilities and capacities change.  Second, we will not only have to continually develop and refine tools throughout the project, we will need to develop frameworks to continually test and compare the tools that are available.  It&#8217;s always fun to hit a moving target.</p>
<p>The meeting also discussed the ethical, legal, and social implications (ELSI) of the project.  This discussion largely focused on which populations to sample for the project.  Should we deepen our knowledge of individuals of Central European, African, and East Asian ancestry to aid in methodology development?  Or should we broaden our knowledge of overall human variation by including fewer individuals from a larger number of populations?  To be determined&hellip;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/05/n-genomes.html/feed</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

