PolITiGenomics

Politics, Information Technology, and Genomics

The cost of doing sequencing

AddThis Social Bookmark Button

June 23rd, 2010

Whenever you get asked about a recent genome publication or the latest sequencing technology, the conversation invariably turns to cost. It turns out, cost is a tricky thing. When people talk of the “cost” of the Human Genome Project, they typically quote the cost for the entire project. A cost that includes sequencing instruments (several revisions), personnel, overhead, consumables, informatics, and IT. They contrast this rather large cost to the much lower cost of the $10,000 or $1,000 genome. However, in reality that “$10,000 genome” costs more than $10,000 (same goes for the $1,000 genome). You see, when people talk about the $10,000 genome, they are only accounting for the cost of consumables: flow cells and reagents. Perhaps this focus on consumables has its roots in the days of the Human Genome Project when reagent (BigDye®) costs dominated sequencing costs. Perhaps the focus is driven by marketers at the sequencing instrument companies who want to draw attention away from the six-figure sequencing instrument costs. Perhaps this focus is driven by the $10,000 recurring cost number specified by the Archon X PRIZE for Genomics, which receives much more attention than the $1 million direct cost cap. Regardless of the reason for the focus on consumables (likely some combination of all of the above), the reality is that consumable costs have fallen much more rapidly than any other cost associated with genome sequencing and can no longer be the only number quoted when stating the cost of a genome; at least if you want that number to actually mean anything.

So, what other costs should be considered? Well, the types of costs and actual values will depend greatly on your situation. Will you be doing the sequencing or will you be contracting at a core facility or sequencing-as-a-service company? Will you be doing the analysis or relying on a third party? How will you be validating your results? How many people will be working on the project at what percent of their efforts? Will you buy everyone a Pet Rock when the project reaches 1 exabases of sequence?

Here I’ll run through a standard cost calculation for a typical academic sequencing and analysis center to sequence and analyze a human genome. The names and costs have been changed to protect the innocent (this means I chose nice, round numbers that are the right order of magnitude). Why not use real numbers? Read the previous paragraph (I’ll wait …): your cost factors and numbers will not be the same as anyone else’s. So you’re going to have to do the calculation for yourself, not just lift the numbers from this post.

First we can consider the consumables (e.g., flow cells and reagents) costs. Let’s say those are $10,000. Then there is the instrument depreciation. Let’s say the instrument costs $600,000, has an expected life of three years, and can do 40 runs per year. Assuming a straight-line depreciation, the instrument depreciation per run is $5,000 (= $600,000 / (3 × 40)). If the instrument supports two flow cells, you would divide the number in half to get $2,500. Now, the DNA doesn’t just hop on the sequencer by itself. DNA has to be acquired, consents signed and approved by institutional review boards (IRBs), and sequencing libraries have to be made. Let’s say sample acquisition costs $100,000 for 50 samples; that’s $2,000 per sample. Shepherding the project and consents through the IRB takes one full-time employee (FTE) at 10% effort one month. We’ll say the cost of one FTE (salary, benefits, etc.) is $60,000 per year. So getting the project through IRB approval costs $500. If the project is able to use all 50 samples, that’s only $10 per sample! If the consumables and personnel time to make a sequencing library is $200, then the total production cost for sequencing our human genome is $14,710. Wait, I forgot the IT and LIMS support! In this scenario we’ll say that each instrument needs one IT FTE and one LIMS FTE, each at 25% effort ($750). And you need disk space for the data ($1,000, you can cut that in half if you throw away everything but the sequence, qualities, and alignments) and compute time ($100) to run alignments and QC. Add to that 50% overhead charges that your institution takes to cover administration, utilities, lab space, etc. (a company would need to determine each of these costs and add them in rather than this overhead multiplier) and your $10,000 genome costs you nearly $25,000. And you haven’t even called a variant yet.

Speaking of variants, let’s assume you want to call SNPs, indels, and structural variations. The first thing you will have to do is align your reads. Let’s say you are efficient and simply use the alignments from the production QC step. Above we assumed $100 for these alignments, but what goes into that number? First you have to determine an average alignment time per genome. Let’s say 90 Gb of sequence (30× coverage of a human genome) in 2×100 base read pairs takes 1,000 core×hr to align to the human reference genome. If you did this on Amazon EC2 ($0.17/core×hr), it would cost you $170 (plus data transfer and storage costs). If you have your own cluster, you need to amortize the cost of your cluster (compute nodes, racks, networking equipment and cabling, PDUs, etc.) per core×hr, add in the cost of your administrators per core×hr, and utilities or overhead per core×hr to get your cost. When you do that calculation, let’s say you get $0.10 per core×hr, so the alignment costs you $100 (but you already paid it above). Merging the BAM files from each lane’s worth of data and marking duplicates takes 50 hours, costing $5. Calling SNPs and indels (including reassembly) takes 100 hours, costing $10. Detecting structural variation using aberrant read pairs takes 200 hours, costing $20. Annotating all the variants across an entire genome takes 100 hours, costing $10. The disk space for all of this costs you $1,000 (again, you’ll need to calculate a cost per GB factoring storage, racks, switches, servers, personnel, etc. to get this number). Finally, somebody needs to run (or automate) this analysis pipeline. Figure that one analyst and one developer each at 10% effort can accomplish this over the course of two weeks; $480. Add all this up and your analysis with overhead runs you about $2300, or about 10% of the cost of generating the data. Of course, human resequencing for variant detection is not the only application of sequencing data. Other types of analysis, e.g., de novo assembly and metagenomic analysis, can have significantly higher costs per base. For example, in metagenomic analysis you may want to classify reads that do not align to known sequences by aligning them in protein space against a database like NCBI nr. If you generate 10 Gb of sequence per sample and 25% of the read pairs do not align to anything else, you will need to align 12.5 million reads. If you use the most common tool for this sort of alignment, NCBI BLAST+ blastx, it would take over 5,500 core×hr, costing about $550 by itself.

Now that you have your sequence data and list of variants, you are going to need to validate them. There are a lot of different ways to validate variants, e.g., PCR, pool, and sequence or Sequenom, so I am not going to go through a detailed cost calculation. It suffices to say that, depending on the number of variants you want to validate, the cost can rise into the thousands of dollars. Whatever platform you choose, you will need to go through a thorough cost calculation (like the one done above for the original sequencing and analysis). For the sake of this post, which is already too long, we’ll say the validation cost is $2,000.

Finally, somebody has to be running this show. Let’s say project management personnel costs $20,000, or $400 per sample. Put this all together and your $10,000 genome costs about $30,000. In other words, the often quoted consumables number only accounts for about 50% of the total cost (Note: overhead applies to consumables also, so while $10,000 looks like 1/3 of $30,000, it is actually half). Again, none of the numbers I use above are real (but they are in the ball park) and all sequencing and analysis facilities are going to have different contributors to their costs resulting in varying contributions from consumables. However, regardless of the cost contribution of consumables at present, the cost of consumables are projected to fall below $5,000 by the end of this year, and they won’t stop there. As such, it is already meaningless to only quote consumable costs when stating the price of sequencing a genome. By the end of the year, it will be ridiculous.

Update: Clarified Archon X Prize cost accounting.

BigDye is a registered trademark of Life Technologies.

Posted in genomics, IT | 12 Comments »

Tagged with: , , , ,


You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

12 Responses to “The cost of doing sequencing”

  1. I think the cost of the analysis is way too low. I know institutions where after many years of full-time (2-4) developers, no functional LIMS or pipeline was created

  2. Mr. Doe (if that is indeed your real name), you are absolutely correct that, in some (perhaps most) sequencing and analysis facilities, the approximate numbers I provide will not be accurate. That is why I encourage everyone to go through this exercise themselves and determine the real cost at their facility. The analysis numbers could easily be double or quadruple the numbers I state. Most facilities don’t have 10% or 25% persons they can use, so if you only have one instrument you will not be able to “split” their efforts across all the data being generated.

  3. Brilliant analysis. Thanks!

  4. Here is my earlier take on this issue, prompted by the announcement of the Archon prize:

    http://www.synthesis.cc/2007/01/a-few-thoughts-on-rapid-genome-sequencing-and-the-archon-prize.html

    Obviously the costs have changed a bit, but the general story is about the same.

    Nice job.

    - Rob Carlson

  5. The Archon Genomic X PRIZE WILL include machine amortization, IT, lab management systems and all personnel costs in its judging of the $10 million purse contest.

    Larry Kedes
    Senior Advisor
    X PRIZE Foundation
    kedes@usc.edu

  6. Dr. Kedes, thanks for the comment. If I read Section 1.5 of the Competition Guidelines for the Archon Genomics X Prize (pdf), the machine and IT (presumably including analysis costs) costs would fall under the $1 million limit for direct operating costs. Is that a misreading? Are those costs included in the recurring costs?

  7. [...] It’s been obvious for some time that cost will soon be no obstacle to getting your genome sequenced as part of a routine clinical workup. What’s been less clear is just how useful that is going to be, and how physicians should go about incorporating a patient’s genome sequence into routine clinical decisions. (Check out a discussion of where costs are now here.) [...]

  8. Good post on cost analysis. Seems IT is counted twice, once for basic support collecting the data; LIMS and QC analysis, and again for scientific analysis. This would make the 10% total cost estimate a bit higher, but sill reasonable.

    However, groups need to be at significantly high production scales to achieve this optimized ratio as was eluded to earlier in the thread. For small groups who have a hard time getting fractions of people, the IT costs can easily dominate.

  9. Todd, I thought you might have an opinion on this post! IT is counted twice because support is need for sequence generate and sequence analysis. They both use storage and compute resources. And yes, fractional people are hard to come by (further underscoring that each entity needs to do this sort of calculation for their own situation).

  10. So, let’s forget about this scenario and go for shopping at Complete Genomics’ sale dept, and get your genome and variant list for $7000-10,000.

  11. Dr. DNA (if that is your real name), that is still not the whole cost: data transfer, storage on your system of the results, making sense of the variants, validation, project management, etc.

  12. Coming in October from Xlibris Corp.
    Genascent: The Human Genome Project in Plain Words
    By C.J. Canna
    This book was written for thoughtful members of the public. Many fields brush up against this science, finance, law, food, international relations, medicine, everything from agriculture thru zoology. This book discusses what a logical person might want to know before using or investing in the science.
    The book includes particularly relevant journal articles explained here in common language.
    You’ll be introduced to the ethical and legal questions that were buzzing around the Genome from the start. To engage the reader it includes funny stories collected over thirty years of a research career. To make the cut the stories had to tell something about the science that the reader should know; or show something about the research environment.
    The author has written a trilogy of movie scripts named “Genascent: Footprints in Time” which tells the story of genetics from Gregor Mendel through1993; “Genascent II: the Living Code” covering the start-up through the completion of the sequence; and “Genascent III: So It Is Written” showing the impact the Genome has had in fighting disease.
    The stories in this book could not be fit into the movies.

Leave a Reply