PolITiGenomics

Politics, Information Technology, and Genomics

Medical pioneers

AddThis Social Bookmark Button

July 1st, 2009

U.S. News & World Report has a new article about pioneers in the field of medical research. Included in the list are The Genome Center’s very own Elaine Mardis and Rick Wilson. The article discusses our work sequencing the first cancer genome and our plans to sequence 150 more by early next year.


HMP funding announced

AddThis Social Bookmark Button

June 23rd, 2009

Human Microbiome Projec

The major funding announcements for the Human Microbiome Project were made public today. Washington University School of Medicine in St. Louis really stands out amongst those awarded grants. As the NIH press release indicates, in addition to the The Genome Center’s $16.1M large-scale sequencing grant (the largest of the sequencing grants), Ellen Li, MD, PhD, Greg Storch, MD, and Phil Tarr, MD, each received about $1M demonstration project grants to study Crohn’s disease, viruses that cause sudden high fevers in children, and necrotizing enterocolitis (a devastating intestinal disease mainly affecting premature infants), respectively. You can find more coverage of the grant announcements at GenomeWeb Daily News and In Sequence and the St. Louis Business Journal.

While HMP pilot projects and “jump start” funding has allowed some of the work to get underway, it really all begins now in earnest. Considering that the human microbiome makes up about 90% of the cells in the human body, i.e., only 10% of the cells in the human body are actually human, unraveling the complex interactions of all these microbes at different body sites is a daunting task. There is already evidence suggesting that the microbiome can affect many aspects of human health, from intestinal disease to gum disease to cancer to obesity. Much like the Human Genome Project paved the way for further medical discoveries, so too will the HMP.


VarScan published

AddThis Social Bookmark Button

June 23rd, 2009

VarScan, a tool developed at The Genome Center to detect variants in massively parallel sequence data has been published in Bioinformatics. VarScan can process both 454 and Solexa data of individuals or pools. You can find more information about VarScan in a post by Dan Koboldt, one of the paper’s and VarScan’s authors.


UR explained

AddThis Social Bookmark Button

June 19th, 2009

Tony Brummet from The Genome Center gave a presentation earlier this week at the St. Louis Perl Mongers meeting on UR. The kind folks at StL.pm have posted the videos for the geographically challenged to enjoy.

Part 1
Part 2

You can find a PDF of the slide deck in the Files section of the StL.pm Google Group page.


Link pin, king pin, Ryskin

AddThis Social Bookmark Button

June 19th, 2009

A bit off topic here, but a former professor of mine seems to be making a stir in the scientific community. Professor Gregory Ryskin has always been his own man; spending his time in the Soviet army reading the Feynman Lectures on Physics and not exactly drumming to the department chair’s beat while I was at Northwestern. He also made a few enemies of students with his “no partial credit” policy: answers on the weekly two question quizzes (upon which your grade was entirely based; there were no tests) were either right or wrong. His reasoning was that in the real world, if you get it wrong and a catastrophe ensues, you don’t get partial credit for the part you got right (certain newsworthy events of the time strengthened his argument). Despite this rigidity, his classes were great; probing the foundations of the material and explaining them in clear, concise terms. Plus, he taught a lot of special topics classes on things like statistical thermodynamics and path integrals that were outside the normal chemical engineering curriculum but very interesting.

The external controversy began after Prof. Ryskin became interested in geology and geophysics a few years back. He subsequently published a paper proposing an alternative theory for mass extinctions, such as that of the dinosaurs.

Now Prof. Ryskin is at it again with a recent paper proposing a new theory attributing the secular variation in the earth’s magnetic field to changing ocean currents. Because salt water conducts electricity, any induced currents would affect a magnetic field. Prof. Ryskin found that by applying magnetohydrodynamics equations to the temporal changes in the ocean flow field, those flow changes could be correlated to the temporal changes in the Earth’s magnetic field. Of course this does not prove the theory correct, only that the theory is consistent with existing observations. Nonetheless, it is an interesting entry into the scientific debate and a welcome one from a great thinker.


Illumina cluster needs

AddThis Social Bookmark Button

June 18th, 2009

There is an interesting thread over at the Solexa Google Group about the IT infrastructure needed to support an Illumina Genome Analyzer (GA). The discussion focuses mostly on the cluster and, to a lesser extent, the storage and network required to operate the instrument and generate sequence data (primary analysis). At The Genome Center, we use Platform LSF HPC as our batch scheduler and currently use lsgmake-gap to execute the GAPipeline (the Illumina primary analysis software) in parallel on our cluster. However, GAPipeline is developed and tested by Illumina on a cluster managed by Sun Grid Engine (SGE), which is free/open source software. This situation creates some headaches for us because as the internals of GAPipeline change, we need to regularly update lsgmake-gap so that GAPipeline will continue to run properly on our cluster. Several years ago when we migrated to LSF, the driving force for the selection of LSF was that it was the only batch scheduler that could handle scheduling 50,000+ jobs at a time (a regular occurrence on our cluster). Fortunately, users may no longer have to choose between scalability and ease of use when running GAPipeline as part of their larger computational needs. Chris Dagdigian, who writes the gridengine.info blog, had this information about the current capabilities of SGE.

  1. SGE 6.2 design goal includes supporting a single array job with 500,000 tasks and hundreds of thousands of concurrent jobs
  2. People have been running hundreds of thousands of SGE jobs per week since the SGE 5.3 days many years ago
  3. I personally know of several sites pushing hundreds of thousands of heavy SGE jobs per week through their systems right now
  4. SGE 6.2 runs a 62,000 core cluster in Texas (RANGER) and has been for some time

“tens of thousands of jobs” is actually pretty easy with Grid Engine and has been for some time, scaling issues encountered in this range have more to do with bad spooling decisions, filesystem design and occasionally an overwhelmed qmaster host. The developers have worked quite a bit this year to improve threading performance, reduce memory footprints and remove things like external RSH methods that consumed system resources like filehandles and TCP ports etc.

This is especially evident in the SGE 6.2 and 6.2u1 release series where speed and scaling were specifically addressed as part of the design effort (6.2u3 and 6.3 will introduce new features). This is the reason why the SGE scheduler is now a thread within the qmaster - one of the more obvious user-visible changes made recently. (emphasis mine - dd)

There are many reasons why one would chose between LSF vs SGE (I have used both for years now) but scaling is not one of the significant selection factors. Features, price, APIs and quality of documentation are far more important along with community adoption/support.

I would guess breaking out the scheduler into its own thread is a major factor in SGE’s ability to manage so many jobs. This was the major deficiency of SGE and other batch schedulers we tested at the time. Several systems designed their schedulers to automatically run through the list of jobs a certain intervals. With a lot of jobs in the queue, the scheduler would not finish its previous traversal before the new one was scheduled to start. Depending on the design implementation this meant that either the original scheduling was killed and the scheduler never processed some jobs or that scheduler threads kept spawning until the resources were exhausted on the master node (that’s bad).

(A couple asides here, since GAPipeline is built on Makefile’s, another option that came up in the thread was parallel execution across an LSF cluster using distmake. Because of our interest in grid computing, we are currently investigating replacing LSF with Condor.)

Of course, with the roll out of SCS2.4 with RTA (real-time analysis), most of the primary analysis is now done on the instrument control computer. Thus, all of this talk about the requirements to produce sequence from the machine are made much less important. Now there is only one stage of the pipeline, the alignment and reporting (called GERALD), now run off the instrument computer. The most computationally intensive part of this stage of the pipeline is the alignment (ELAND and its post-processing) and it can only be made parallel on a per lane basis, i.e., eight ways.

Of course, there is also the specter of the requirements for sequence analysis at Illumina GA scale, but that’s a whole other post…


Learning opportunities

AddThis Social Bookmark Button

June 17th, 2009

These links came to my attention this past weekend and I thought they might be of use to some of the readers here. First, you can access all course materials, even lectures, for the CS61A: Structure and Interpretation of Computer Programs course at UC Berkeley. The course comes highly recommended. Second, Melissa Kahney has aggregated links for a bunch of UNIX and GNU/Linux tutorials grouped by topic and target audience (beginner and expert).


Great Expectations

AddThis Social Bookmark Button

June 15th, 2009

A colleague of mine at The Genome Center pointed me to this O’Reilly Radar blog post about the talks at OSCON 2009 that Allison Randal, one of the organizers, considers highlights. Very kindly, she mentions my talk, The Freedom to Cure Cancer. I have a rough outline of the talk clanging around in my head. Having it take shape on a slide deck is going to take some work (and a lot of time on Google image search). Hopefully, the talk will live up to the hype.


Bottomed out?

AddThis Social Bookmark Button

June 11th, 2009

You know how they keep saying the economy has bottomed out and things are starting to turn around? Don’t be too sure. This “recovery” is all on borrowed money. Sooner or later, the piper will come calling. Watch the video below where Peter Schiff explains how the fundamentals of our economy are far from sound (see this post for some background on Mr. Schiff).


Do some good

AddThis Social Bookmark Button

June 10th, 2009

To begin to wash off all the evil that scientists do, why not do some good by helping a worthy charity like the St. Louis Crisis Nursery. The Crisis Nursery has several events coming up. There are wine tastings on June 18, 2009 in Chesterfield, MO (see details (pdf)) and on June 25, 2009 in St. Charles County. On August 20, 2009, the Nursery will host its annual Celebrity Waiters’ and Waitresses’ Night at Cardwell’s and Canyon Cafe in Plaza Frontenac. Or, you can just donate some of your ill-gotten gains.

St. Louis Crisis Nursery Wine tasting invitation