<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PolITiGenomics &#187; 1000 Genomes</title>
	<atom:link href="http://www.politigenomics.com/tag/1000-genomes/feed" rel="self" type="application/rss+xml" />
	<link>http://www.politigenomics.com</link>
	<description>Politics, Information Technology, and Genomics</description>
	<lastBuildDate>Thu, 21 Apr 2011 17:49:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Breakin&#8217; 3: Genomic Variations</title>
		<link>http://www.politigenomics.com/2009/10/breakin-3-genomic-variations.html</link>
		<comments>http://www.politigenomics.com/2009/10/breakin-3-genomic-variations.html#comments</comments>
		<pubDate>Fri, 02 Oct 2009 15:00:48 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1575</guid>
		<description><![CDATA[In the long awaited follow up to Breakin&#8217; and Breakin&#8217; 2, Ken Chen has released BreakDancer. As described in his Nature Methods article and a recent Genome Technology article, BreakDancer is not so much a movie as it is a bioinformatics program that can detect structural variation (insertions, deletions, inversions, and translocations) in genomes using [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.imdb.com/title/tt0086999/"><img src="http://www.politigenomics.com/wp-content/uploads/2009/10/breakin2.jpg" alt="Breakin&#039; 2" title="Breakin&#039; 2" width="250" height="328" class="alignright size-full wp-image-1576" /></a></p>
<p>In the long awaited follow up to <a href="http://www.imdb.com/title/tt0086998/">Breakin&#8217;</a> and <a href="http://www.imdb.com/title/tt0086999/">Breakin&#8217; 2</a>, <a href="http://genome.wustl.edu/people/chen_ken">Ken Chen</a> has released <a href="http://genome.wustl.edu/tools/cancer-genomics#variant-detection-tools">BreakDancer</a>. As described in his <a href="http://www.nature.com/nmeth/journal/v6/n9/abs/nmeth.1363.html">Nature Methods article</a> and a recent <a href="http://news.google.com/news/url?sa=t&#038;ct2=us%2F0_0_s_0_0_t&#038;usg=AFQjCNEtrzqHx1YXrLeKGIz7QvMHKHu4AQ&#038;cid=0&#038;ei=mgXGSsjNCJXM8ATose9r&#038;rt=SEARCH&#038;vm=STANDARD&#038;url=http%3A%2F%2Fwww.genomeweb.com%2Finformatics%2Fnew-breakdancer-algorithm-performs-high-res-mapping-indels-more">Genome Technology article</a>, BreakDancer is not so much a movie as it is a bioinformatics program that can detect <a href="http://en.wikipedia.org/wiki/Mutation#By_effect_on_structure">structural variation (insertions, deletions, inversions, and translocations) in genomes</a> using paired-end read data. It can be used to detect structural variation in individual genomes, pools of genomes (like the low-coverage <a href="http://www.1000genomes.org/">1000 Genomes</a> data), and tumor/normal samples with greater sensitivity and specificity than other structural variation detection algorithms.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/10/breakin-3-genomic-variations.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Easy racism</title>
		<link>http://www.politigenomics.com/2009/08/easy-racism.html</link>
		<comments>http://www.politigenomics.com/2009/08/easy-racism.html#comments</comments>
		<pubDate>Wed, 12 Aug 2009 14:53:12 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[Helicos]]></category>
		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1472</guid>
		<description><![CDATA[Well President Obama and I now have something in common: we have both been accused of being racist against white people. President Obama by Glenn &#8220;&#8216;President Obama has a deep-seated hatred of white people&#8217; and 75 seconds later &#8216;I&#8217;m not saying President Obama doesn&#8217;t like white people&#8217;&#8221; Beck and me by neoprene nancy (not to [...]]]></description>
			<content:encoded><![CDATA[<p>Well President Obama and I now have something in common: we have both been accused of being racist against white people.</p>
<div class="embedvideo"><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/MI_0Kt_e3Go&#038;hl=en&#038;fs=1&#038;"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/MI_0Kt_e3Go&#038;hl=en&#038;fs=1&#038;" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object></div>
<p>President Obama by Glenn &#8220;&#8216;President Obama has a deep-seated hatred of white people&#8217; and 75 seconds later &#8216;I&#8217;m not saying President Obama doesn&#8217;t like white people&#8217;&#8221; Beck and me by <em>neoprene nancy</em> (not to be confused with <a href="http://www.beatlestube.net/video.php?title=Polythene%20Pam">Polythene Pam</a>) in a <a href="http://www.politigenomics.com/2009/08/another-rich-white-guy-sequences-own-genome.html?comment-15315">comment</a> on my recent post <a href="http://www.politigenomics.com/2009/08/another-rich-white-guy-sequences-own-genome.html">Another rich white guy sequences own genome</a> about Stephen Quake sequencing his own genome. <a href="http://www.tweakguides.com/Quake4_1.html"><img src="http://www.politigenomics.com/wp-content/uploads/2009/08/quake.jpg" alt="Quake" title="Quake" width="200" height="160" class="alignright size-full wp-image-1476" /></a> The entirety of her comment is &#8220;<a href="http://www.politigenomics.com/2009/08/another-rich-white-guy-sequences-own-genome.html?comment-15315">Another blogging guy inadvertently reveals racism.</a>&#8221; I truly find it a strange and discomforting thing that people so quickly and easily throw out accusations of racism. It is almost as though racists are adopting the approach taken by hip hop artists with the N-word: adopt the word&#8217;s use to the point of overuse to diminish its derogatory connotation. Hopefully <em>neoprene nancy</em> was joking. In case she was not, let&#8217;s take the accusation apart. Since she does not call me socialist or sexist and herself uses the term &#8220;guy&#8221;, I assume she is only concerned with my reference to &#8220;white&#8221; and not &#8220;rich&#8221; or &#8220;guy&#8221;. Perhaps, like Glenn Beck would have us believe about President Obama, I am a self loather and have a deep-seated hatred of white people (for those of you wondering, you can see a <a href="http://www.politigenomics.com/about">picture of me</a>). Or perhaps my dissatisfaction with the selection of another healthy, white male for whole genome sequencing has to do with the fact that it will provide essentially no scientific value. This is likely why the sequence analysis was so cursory and why it got published as a <a href="http://www.nature.com/nbt/journal/vaop/ncurrent/abs/nbt.1561.html">letter in Nature Biotechnology</a> rather than an article in a top-tier journal: it is essentially just a proof of concept for the <a href="http://www.helicosbio.com/">Helicos</a> platform. While knowing his sequence has provided some health insights for Stephen Quake (see <a href="http://www.bio-itworld.com/news/08/10/09/stephen-quake-personal-genome-single-molecule-sequencing.html">Quake Traits</a>), it does not advance our understanding of the relationship between genotype (DNA) and phenotype (traits). It does not add to our understanding of natural human variation beyond that provided by the <a href="http://www.1000genomes.org/">1000 Genomes Project</a>. It does not help us to understand variants associated with cancer or other common diseases. It does not help us to interpret the biological role of conserved sequences in the genome that are not in genes. In short, it is a novelty and should be called out for what it is, another rich, white guy sequencing his own genome.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/08/easy-racism.html/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The evil that scientists do</title>
		<link>http://www.politigenomics.com/2009/06/the-evil-that-scientists-do.html</link>
		<comments>http://www.politigenomics.com/2009/06/the-evil-that-scientists-do.html#comments</comments>
		<pubDate>Wed, 10 Jun 2009 18:29:25 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[health]]></category>
		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1214</guid>
		<description><![CDATA[By now it is well know that scientists are horrible people. It is clear from reports like that found last month in PLoS ONE that all scientists (OK, well more than 72%) are lying data falsifiers with questionable morals. This week we find out from the (n)ever insightful Sharon Begley of Newsweek that researchers are [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://commons.wikimedia.org/wiki/File:Mad_scientist_transparent_background.svg"><img src="http://www.politigenomics.com/wp-content/uploads/2009/06/mad_scientist.png" alt="mad scientist" title="mad scientist" width="222" height="213" class="alignright size-full wp-image-1225" /></a></p>
<p>By now it is well know that scientists are horrible people. It is clear from reports like that found last month in PLoS ONE that all scientists (OK, well more than 72%) are <a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0005738">lying data falsifiers with questionable morals</a>. This week we find out from the (n)ever insightful <a href="http://www.newsweek.com/id/32249">Sharon Begley</a> of Newsweek that <a href="http://www.newsweek.com/id/200599">researchers are withholding life-saving information to further their careers</a>. The selfish beasts! Just think of all the children whose lives could be saved if it weren&#8217;t for these terrible scientists. Certainly the few anecdotes she presents are indicative of the entire reality faced by scientists and medical practitioners and not just sour grapes from researchers with an inflated sense of the importance of their research. Certainly these &#8220;top tier&#8221; journals are publishing complete rubbish instead of these seminal papers. Everyone knows that tenure decision are based solely on the number of publications a PI has in Nature. It has nothing to do with how much grant money (and therefore overhead to the institution) the PI brings in.</p>
<p>Of course these mass murderers must be stopped and Ms. Begley says the new head of NIH is just the one to do it. As many readers of this blog will undoubtedly know, <a href="http://www.genome.gov/10000779">Francis Collins</a>, former head of the <a href="http://www.genome.gov/">National Human Genome Research Institute (NHGRI)</a>, has been rumored to be on the <a href="http://www.latimes.com/features/health/la-na-nih-collins23-2009may23,0,5889122.story">short list for NIH director</a>. Now we must all ask ourselves, would Sharon Begley approve? With his penchant for pushing big projects like the <a href="http://www.genome.gov/10001772">Human Genome Project</a> and the <a href="http://www.1000genomes.org/">1000 Genomes Project</a> that have publications in prestigious journals, can he be trusted? Never mind his role in pushing for early release of <a href="http://www.sanger.ac.uk/HGP/policy-forum.shtml">human sequence data</a> and driving <a href="http://www.politigenomics.com/2008/05/gina-becomes-law.html">GINA</a> through the U.S. Congress. Prestige-hounds like Dr. Collins must be stopped!</p>
<p><em>(Yes, I&#8217;m being sarcastic. And, yes, I know publications and grant money are linked, but fairly weakly. Lots more people get grant money than are published in &#8220;top tier&#8221; journals. As Randy Newman sings, &#8220;it&#8217;s money that matters&#8221;.)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/06/the-evil-that-scientists-do.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>1000 Genomes phase change</title>
		<link>http://www.politigenomics.com/2009/05/1000-genomes-phase-change.html</link>
		<comments>http://www.politigenomics.com/2009/05/1000-genomes-phase-change.html#comments</comments>
		<pubDate>Thu, 21 May 2009 15:28:06 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1113</guid>
		<description><![CDATA[The 1000 Genomes Project is transitioning from its pilot phase into the full project. Since 1000 Genomes is a mixture of data production centers and not a pure component, the phase change from pilot to full project is a continuum rather than a sharp transition. Some centers have already started sequencing of the full project [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.1000genomes.org/">1000 Genomes Project</a> is transitioning from its pilot phase into the full project. Since 1000 Genomes is a mixture of data production centers and not a pure component, the phase change from pilot to full project is a continuum rather than a sharp transition. Some centers have already started sequencing of the full project while other centers are finishing up their pilot 3 (targeted, capture-based sequencing of 1000 genes) commitments, while some are doing both. The full project will be expanding the populations sequenced beyond the CEU (European), YRI (African), and CHB (Asian) populations sequenced in the pilot projects to include a wider diversity of major populations as well as focus on sub-populations; some samples are still in the process of getting properly consented and collected.</p>
<p>Of course, even though they were just three &#8220;pilot&#8221; projects, they still generated a <em>lot</em> of data and new variants. According to Paul Flicek of EBI, over 3.6 Tb of sequence in over 95 billion sequence reads have been submitted and made available on the <a href="ftp://ftp-trace.ncbi.nih.gov/1000genomes/">1000 Genomes FTP site</a>. For some coverage of the new variants that have been found, check out this <a href="http://www.timesonline.co.uk/tol/news/uk/science/article6314464.ece">Times Online</a> article and the GenomeWeb coverage of <a href="http://www.genomeweb.com/informatics/researchers-uncovering-variants-1000-genomes-pilot-data">Gon&ccedil;alo Abecasis&#8217;s talk</a> at the <a href="http://meetings.cshl.edu/meetings/genome09.shtml">Biology of Genomes</a> meeting a few weeks ago.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/05/1000-genomes-phase-change.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Paul Flicek&#8217;s a Mac</title>
		<link>http://www.politigenomics.com/2009/05/paul-fliceks-a-mac.html</link>
		<comments>http://www.politigenomics.com/2009/05/paul-fliceks-a-mac.html#comments</comments>
		<pubDate>Mon, 18 May 2009 20:02:33 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1103</guid>
		<description><![CDATA[Paul may kill me for posting this, but I&#8217;ll assume he allowed himself to be recorded and the video posted because he wanted people to see it. Enjoy.]]></description>
			<content:encoded><![CDATA[<p>Paul may kill me for posting this, but I&#8217;ll assume he allowed himself to be recorded and the video posted because he <em>wanted</em> people to see it. Enjoy.</p>
<div class="widevideo"><object width="480" height="295"><param name="movie" value="http://www.youtube.com/v/s4ReUHP-2eI&#038;hl=en&#038;fs=1"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/s4ReUHP-2eI&#038;hl=en&#038;fs=1" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="480" height="295"></embed></object></div>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/05/paul-fliceks-a-mac.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>CSHL Biology of Genomes 2009</title>
		<link>http://www.politigenomics.com/2009/04/cshl-biology-of-genomes-2009.html</link>
		<comments>http://www.politigenomics.com/2009/04/cshl-biology-of-genomes-2009.html#comments</comments>
		<pubDate>Mon, 27 Apr 2009 21:50:46 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[compute]]></category>
		<category><![CDATA[CSHL]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=1059</guid>
		<description><![CDATA[I should have posted this earlier, but things have been pretty busy. In any event, I will be presenting a poster next week at the Biology of Genomes meeting at Cold Spring Harbor. The poster is entitled &#8220;Maximizing utility of genome sequence data&#8221;. Here is the abstract. Advances in DNA sequencing technologies over the past [...]]]></description>
			<content:encoded><![CDATA[<p>I should have posted this earlier, but things have been pretty busy. In any event, I will be presenting a poster next week at the <a href="http://meetings.cshl.edu/meetings/genome09.shtml">Biology of Genomes</a> meeting at <a href="http://www.cshl.edu/">Cold Spring Harbor</a>. The poster is entitled &#8220;Maximizing utility of genome sequence data&#8221;. Here is the abstract.<br />
<blockquote>Advances in DNA sequencing technologies over the past few years have led to data generation and processing rates that far outpace Moore&#8217;s Law and storage capacity improvements.  As a result, there will come a time when one will no longer be able to “throw more money” at the problems presented by DNA sequencing, i.e., researchers will not be able to keep pace with data generation by purchasing more and more storage and computational nodes.  Proposed sequencing platform improvements and the rapid rate of adoption of these technologies by labs large and small will only hasten the time when the old solutions will no longer apply.  The history of freely shared sequence data through the NCBI and EBI Trace Archives transform the very difficult problem of massive sequence data generation into a problem of data generation and data sharing on a scale heretofore unimaginable.  Over the last year, several organizations, e.g., MGED, NCI, Illumina, 1000 Genomes DCC, and NHGRI, have convened meetings to discuss the problems presented by the massive amounts of data generated by next-generation sequencing technologies.  As prologue, brief overviews of these meetings will be presented along with approaches to dealing with massive data generation rates from other disciplines, e.g., high energy physics and high-resolution medical imaging.  The Genome Center at Washington University in St. Louis, due to its large-scale sequencing operation and whole-genome analysis capabilities, experiences the difficulties presented by massively-parallel sequencing platforms acutely.  To address the many challenges presented by the scale of data generation and requisite analysis, we have developed a multidisciplinary approach involving experts in biology, genomics, bioinformatics, computer science, information technology, and engineering.  The resulting approach involves many techniques including intelligent compression and data reduction, data aging, archiving, parallelization, fault-tolerant workflows, scalable software frameworks, and multivariate/multi-genome visualization and comparison, which leverage and extend our laboratory information management system.  This approach and its application to the sequencing and analysis of cancer samples will be presented.</p></blockquote>
<p> It&#8217;s a lot to cover in 4 ft &times; 4 ft, but I&#8217;ll do my best. If you are going to be at Cold Spring Harbor, stop by and say hello.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/04/cshl-biology-of-genomes-2009.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>GIA Talk</title>
		<link>http://www.politigenomics.com/2009/03/gia-talk.html</link>
		<comments>http://www.politigenomics.com/2009/03/gia-talk.html#comments</comments>
		<pubDate>Fri, 27 Mar 2009 21:49:20 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[GIA]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[science]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=998</guid>
		<description><![CDATA[Last week was the first Genome Informatics Alliance meeting. It was a meeting of second-generation sequence vendors, users, data repositories, and other high-throughput endeavors, e.g., high-energy physics and Google, to discuss the challenges that second-generation sequencing is creating for bioinformatics. I gave a talk at the meeting to help introduce the people not in genomics [...]]]></description>
			<content:encoded><![CDATA[<p>Last week was the first Genome Informatics Alliance meeting. It was a meeting of second-generation sequence vendors, users, data repositories, and other high-throughput endeavors, e.g., high-energy physics and Google, to discuss the challenges that second-generation sequencing is creating for bioinformatics. I gave a talk at the meeting to help introduce the people not in genomics to the issues we currently face. Yesterday I recreated the talk, recording an audio track (while battling a bit of a cold), then uploaded the talk and audio today. After wrestling a bit with <a href="http://www.slideshare.net/">SlideShare</a>&#8216;s screen casting interface, I ended up with the video below.</p>
<div class="embedvideo">
<div style="width:425px;text-align:left" id="__ss_1210857"><a style="font:14px Helvetica,Arial,Sans-serif;display:block;margin:12px 0 3px 0;text-decoration:underline;" href="http://www.slideshare.net/ddgenome/challenges-with-data-quality-sharing-and-versioning-in-nextgeneration-sequencing?type=powerpoint" title="Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing">Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing</a><object style="margin:0px" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=challenges-090327102109-phpapp01&#038;stripped_title=challenges-with-data-quality-sharing-and-versioning-in-nextgeneration-sequencing" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=challenges-090327102109-phpapp01&#038;stripped_title=challenges-with-data-quality-sharing-and-versioning-in-nextgeneration-sequencing" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object>
<div style="font-size:11px;font-family:tahoma,arial;height:26px;padding-top:2px;">View more <a style="text-decoration:underline;" href="http://www.slideshare.net/">presentations</a> from <a style="text-decoration:underline;" href="http://www.slideshare.net/ddgenome">David Dooling</a>.</div>
</div>
</div>
<p>I hope you enjoy the video. You may need to turn up the volume on your computer to hear it. I&#8217;ll post more about the meeting soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/03/gia-talk.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>File formats aplenty</title>
		<link>http://www.politigenomics.com/2009/01/file-formats-aplenty.html</link>
		<comments>http://www.politigenomics.com/2009/01/file-formats-aplenty.html#comments</comments>
		<pubDate>Wed, 28 Jan 2009 22:40:52 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[informatics]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=831</guid>
		<description><![CDATA[In a previous post on the 1000 Genomes Project, David Sexton from the Center for Human Genetics Research at Vanderbilt University asked about the new file formats for alignments, assembly, and genotype data. The alignment and mapping format is called Sequence Alignment/Map or SAM. The specification is available as a PDF and there is also [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://www.politigenomics.com/2008/05/n-genomes.html">previous post on the 1000 Genomes Project</a>, <a href="http://www.vanderbilt.edu/oor/cores/chgr-computational_genomics.php">David Sexton</a> from the Center for Human Genetics Research at Vanderbilt University asked about the new file formats for alignments, assembly, and genotype data. The alignment and mapping format is called <a href="http://samtools.sourceforge.net/">Sequence Alignment/Map or SAM</a>. The specification is available as a <a href="http://samtools.sourceforge.net/SAM1.pdf">PDF</a> and there is also a C library (<a href=http://samtools.sourceforge.net/samtools/masterTOC.shtml"">API</a>) available for working with SAM files. You can <a href="http://sourceforge.net/project/showfiles.php?group_id=246254&#038;package_id=300388">download the C source code and tools for working with SAM files (SAMTools)</a>, including a utility for converting <a href="http://maq.sourceforge.net/">Maq</a> map files to SAM files. Genotype data is being stored in the <a href="http://maq.sourceforge.net/glfProgs.shtml">genotype likelihood format (GLF)</a>. Maq (<code>glfgen</code>) can create GLF files from a map file and the reference sequence. Since many other aligners support output in the Maq map format, this means you can generate GLF files from the output of many aligners. Tools that operate on GLF files, including calling SNPs, are available on the Maq site as <a href="http://sourceforge.net/project/showfiles.php?group_id=191815&#038;package_id=293182">glfProgs</a>. Hopefully this all means some (<em>de facto</em>) standards are arising.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2009/01/file-formats-aplenty.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>1000 Genome SNPs released</title>
		<link>http://www.politigenomics.com/2008/12/1000-genome-snps-released.html</link>
		<comments>http://www.politigenomics.com/2008/12/1000-genome-snps-released.html#comments</comments>
		<pubDate>Wed, 24 Dec 2008 13:49:14 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[IT]]></category>
		<category><![CDATA[science]]></category>
		<category><![CDATA[wustl]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=697</guid>
		<description><![CDATA[The 1000 Genomes Project has announced its initial release of SNP data from four of the individuals sequenced to high depth-of-coverage as part of the second pilot project (trios). Here is the announcement from Paul Flicek of EBI and the 1000 Genomes Data Coordination Center (and formerly of Washington University). Dear All, I&#8217;m pleased to [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.1000genomes.org/">1000 Genomes Project</a> has announced its initial release of <a href="http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism">SNP</a> data from four of the individuals sequenced to high depth-of-coverage as part of the <a href="http://www.politigenomics.com/2008/03/1000-genomes.html">second pilot project (trios)</a>.  Here is the announcement from Paul Flicek of <a href="http://www.ebi.ac.uk/">EBI</a> and the 1000 Genomes Data Coordination Center (and formerly of Washington University).</p>
<blockquote>
<p>Dear All,</p>
<p>I&#8217;m pleased to provide everyone a stocking stuffer in the form of the first release of data from the 1000 Genomes project.</p>
<p>The preliminary list of SNPs for 4 of the high coverage individuals are now available on the EBI and NCBI 1000 Genomes FTP sites. Instructions on how to access the data can be found at <a href="http://www.1000genomes.org">http://www.1000genomes.org</a>.</p>
<p>In addition, we have created a project specific genome browser to allow the data to be visualised in the context of genome annotations and data from other projects including the Venter and Watson genomes.  The browser is based on the Ensembl platform and is available at <a href="http://browser.1000genomes.org">http://browser.1000genomes.org</a>.  We will be making updates to the browser throughout January to ensure the 1000 Genomes data is visible by default and is easy to find (SNP tracks can now be found on the &#8220;Features&#8221; menu).  I welcome any comments, questions or suggestions that that you have about the workings of the browser.</p>
<p>A long list of people worked very hard to get this done and any attempt to mention people will certainly miss some.  However, I would like to specifically acknowledge Tom Blackwell, Goncalo Abecasis, Fiona Hyland, Zam Iqbal, Laura Clarke, Eugene Kulesha, Yuan Chen, Stephen Keenan, Fiona Cunningham, Justin Paschall, Martin Shumway,<br />
Hoda Kouri and Steve Sherry.</p>
<p>All the very best for the holiday season.</p>
<p>Paul Flicek
</p></blockquote>
<p>Obviously, the three 1000 Genomes pilot projects have been a massive undertaking that has strained not only the production centers, but the IT and informatics infrastructures of the production and analysis centers. To date, over 3.8 terabases (3.8&times;10<sup>12</sup> or 3.8 trillion bases which is equivalent to about 1270 <a href="http://en.wikipedia.org/wiki/Ploidy">haploid</a> human genomes) have been submitted as part of these pilot projects. The average <a href="http://www.politigenomics.com/2008/06/whats-in-an-srf.html">SRF</a> file submitted to the <a href="http://www.politigenomics.com/2008/05/short-read-archive.html">NCBI SRA</a> stored 50 bytes of information per base; so the amount of data submitted so far is nearly 200 TB! At current broadband rates in the United States, it would take nearly 10 years to download all of this data (those still using 1600 baud modems may want to request they ship you the data on hard drives). Did I mention these are just the <em>pilot</em> projects?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/12/1000-genome-snps-released.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What&#8217;s in an SRF?</title>
		<link>http://www.politigenomics.com/2008/06/whats-in-an-srf.html</link>
		<comments>http://www.politigenomics.com/2008/06/whats-in-an-srf.html#comments</comments>
		<pubDate>Mon, 30 Jun 2008 21:21:20 +0000</pubDate>
		<dc:creator>dd</dc:creator>
				<category><![CDATA[genomics]]></category>
		<category><![CDATA[1000 Genomes]]></category>
		<category><![CDATA[454]]></category>
		<category><![CDATA[Illumina]]></category>
		<category><![CDATA[informatics]]></category>
		<category><![CDATA[SOLiD]]></category>

		<guid isPermaLink="false">http://www.politigenomics.com/?p=100</guid>
		<description><![CDATA[I have written a bit about the NCBI Short Read Archive (SRA), its internals, and data transfer rates. Here is some information about the data format people are using to submit data from the massively parallel sequencers to the SRA. I apologize in advance for all the acronyms. The SRA is currently accepting 454 data [...]]]></description>
			<content:encoded><![CDATA[<p>I have written a bit about the <a href="http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi">NCBI Short Read Archive (SRA)</a>, <a href="http://www.politigenomics.com/2008/05/short-read-archive.html">its internals</a>, and <a href="http://www.politigenomics.com/2008/06/how-fast.html">data transfer rates</a>. Here is some information about the data format people are using to submit data from the massively parallel sequencers to the SRA. I apologize in advance for all the acronyms.</p>
<p>The SRA is currently accepting 454 data in <a href="http://www.454.com/news-events/press-releases.asp?display=detail&#038;id=48">standard flowgram format (SFF)</a> and Solexa in <a href="http://srf.sourceforge.net/">SRF</a> format.  Soon 454 and AB SOLiD will support the SRF format and submissions will commence in that format for those platforms.  The SFF format contains the flowgrams (intensity per cycle at each spot), base calls, and base quality values.  In other words, the SFF is very similar to the SCF format used for capillary sequencing data (except flowgrams are discrete whereas chromatograms are continuous).  Also, NCBI (as <a href="http://www.politigenomics.com/2008/05/short-read-archive.html">recently discussed</a>) has developed their own storage format for massively parallel sequencing data that they will also be accepting as a submission format within the next few months.</p>
<p>So what is an SRF? Well, it is basically just a container format, i.e., what you store in it is up to the implementation.  Thus far, SRF has only been implemented for Illumina/Solexa data; so the rest of this post is specific to that platform and the data types that its implementation of the SRF format contains. The Solexa SRF implementation was done largely by James Bonfield at <a href="http://www.sanger.ac.uk/">Sanger</a> and is distributed as part of the <a href="https://sourceforge.net/project/showfiles.php?group_id=100316&#038;package_id=108243">io_lib</a> package (now distributed separately from the <a href="http://staden.sourceforge.net/">Staden package</a>).  I would imagine that the SOLiD implementation will be very similar to the Solexa implementation.  The 454 implementation will likely be very similar to the SFF already in wide use.</p>
<p>For the <a href="http://www.1000genomes.org/">1000 Genomes</a> <a href="http://www.politigenomics.com/2008/03/1000-genomes.html">pilot projects</a>, the 1000 Genomes Data Collection Center (DCC) is asking that we submit the &#8220;raw&#8221;, &#8220;processed&#8221;, and &#8220;base&#8221; data for each spot.  Raw data are the intensity values (int) and noise (nse) values.  Processed data are the processed intensity values (sig2) and four-channel quality values (prb).  Base data are the base calls (the quality value is gotten from the prb for the called base).  This results in about 50 bytes per base for the SRF. Compared to 2 bits per base, the minimum possible for DNA&#8217;s four letter alphabet, this is a 200-fold increase.  So not only do these instruments generate a lot more data, we are storing more information per base now too.  The average submission for an Solexa run is about 100 GB.</p>
<p>Why store all this extra information?  Essentially, people do not trust/believe the data at this point.  The quality values provided by these pipelines are not as reliable as those generated for capillary sequence data.  Some people want the raw data so that they can develop and improve base calling/quality algorithms. Clearly you would not need <em>all</em> the 1000 Genomes data to develop such algorithms (although the technology changes at such a rate that you would likely want some rolling subset of the latest runs). Others want the raw data because they think they may want to go back and re-analyze data when better algorithms become available. For a wide variety of reasons (disk space, computational cost, network bandwidth, keeping pace with newly generated data), I doubt any such massive re-analysis will ever take place.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.politigenomics.com/2008/06/whats-in-an-srf.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

