January 12th, 2010
Today Illumina announced their new, high-throughput sequencing instrument, the HiSeq 2000. Sure, the name isn’t that great, but the capabilities, if not game changing, are a significant step forward: the HiSeq 2000 can sequence a tumor/normal pair to 30× coverage in one 8-day run. How does it achieve this five-fold improvement in throughput over current second-generation sequencing technologies? What it doesn’t do is change the fundamentals of the Illumina sequencing technology. The HiSeq 2000 uses Sequencing By Synthesis (SBS), just like the Genome Analyzer (GA). In fact, it actually dials down the current SBS state of the art, using lower cluster densities (350,000 – 400,000 clusters/mm2) and read lengths (2×100) than the latest GA IIx release (600,000 clusters/mm2 and 2×125). (Current tiles are 0.5293 mm2, so 600,000 clusters/mm2 equate to about 318,000 clusters/tile.) The throughput improvement comes from two major factors: increased data collection area and rate. The HiSeq 2000 has two 8-lane flow cells, as compared to the single flow cell on the GA, and images both the top and bottom surfaces of the flow cell. In addition, the imaging area of the HiSeq 2000 flow cell is larger than the GA flow cell’s. This all adds up to a more than five-fold increase in surface area to collect data from on the HiSeq 2000. As you know if you operate a GA, the imaging part of each cycle takes up more time than the chemistry portion. Thus, to run two flow cells on the same instrument, Illumina needed to speed up data acquisition so that it was at least as fast as the chemistry stage so that one flow cell could be doing chemistry while the other was imaging (like the SOLiD platform from Life Technologies). To do this, they used their experience with systems like iScan and its Time Delay and Integration (TDI) line imaging technology, and completely replaced the entire optics system. The GA performs area imaging to collect its image data. The flow cell is moved, the camera focuses, and four images (tiles) are taken (one for each base). The flow cell is then moved again and the process repeated. For the current GA IIx, each of the eight lanes is imaged at 120 positions (in a 2×60 grid) resulting in 480 images per lane per cycle. The HiSeq 2000 scans a 2048 pixel wide swath down one side of a lane and then comes back and scans the swath on the other side of the lane. This is then repeated for the other surface in the lane and then across all the lanes. Because of this continuous data collection, there are four cameras in the system rather than one. This line scanning system is able to collect data at a rate of 50 MB/s, as compared to about 8 MB/s in the GA IIx. When you put all of this together, the HiSeq 2000 is able to generate about 200 Gb of sequence from over 1 billion clusters in the form of 2×100 base reads from two flow cells in about eight days with error rates (1-2%) comparable to current GA IIx data (as one would expect since both use SBS). Illumina actually already has data from “production” instruments on several human genomes.
Because of the five-fold increase in sequence data generation rate (25 Gb/day versus 5 Gb/day for the GA IIx), Illumina needed to rethink how it processed and stored all the data. Normal hard drives cannot write four 625 MB images every 30 seconds. As such, images are not written to disk by default; they are processed in memory by the instrument control software (as opposed to the GA where image are written to the disk and processed by RTA which also does the base calling). You can save images if you want, but you will need 32 TB of disk space per run and it will slow down your run. Like the most recent version of RTA for the GA IIx, you can save thumbnail images (without penalty) to aid in troubleshooting (the thumbnails, of course, cannot be used for off-instrument analysis). Because of the need to incorporate phasing and pre-phasing information when base calling, the RTA for HiSeq lags a few cycles behind the current data acquisition cycle. The result is that base calling does not actually complete until about two hours after the run completes. In other words, the processing of data is not real time, but it is synchronous. In fact, if the data analysis falls behind, the instrument is paused in a safe state until it catches up. This is guaranteed to occur at least once in each run: after around five cycles the instrument will pause for about two hours while template generation (cluster identification) is performed. The large data rates also forced Illumina to rethink how they store and transfer data off the instrument. Gone are the QSEQ files, they are replaced by BCL files which are binary, per image, per cycle files that contain the base call and quality information. Because they are per image, per cycle files, they can be transferred cycle by cycle as they are generated (as opposed to QSEQ files which are read based). The BCL files are also more compact, requiring only 1 byte/base (B/b) as compared to QSEQ files which require about 2.5 B/b. In addition, the intensity files are also not transferred by default, so RTA output goes from 10 B/b to just 1 B/b. Thus, even though you are generating five times more sequence data than a GA, your RTA directory will actually be smaller (about 250 GB).
The HiSeq 2000 has a completely new instrument software user interface. The instrument user interface allows the operator to input data via a keyboard and mouse or a touch screen. Run configuration and setup are done via a wizard driven work flow. The setup and running of each flow cell is completely independent. This allows you to start the runs at different times, have different number of cycles for each flow cell, and even do an indexing run on one flow cell and a standard paired-end run on the other. The cycles of each flow cell will need to synchronize so that one is doing chemistry and the other data acquisition. Unfortunately, the current version of the instrument control software has no LIMS integration capabilities. Since this instrument is clearly targeting large genome centers, that is unfortunate.
The instrument software also has greatly enhanced real-time metric reporting as compared to the GA. In addition to the RTA reports, e.g., cluster density, intensity, focus, and quality scores, the standard reports typically generated after a GA run by GERALD, e.g., the Summary report, are generated cycle by cycle by RTA and made available to the operator via the instrument control software and remotely as HTML pages (there is also discussion of a smart phone application). Phi X can be spiked into lanes to allow the software to generate error rate numbers (and Error and Perfect plots) on the fly as well. All in all, the reports are very similar to those people have become familiar with using the GA; they are just generated dynamically during the run. This will allow operators to more carefully observe their runs and take corrective action if something goes awry. All of the extra data processing and reports do not come without the requirement of additional computational horsepower. Don’t worry though, no iPAR is necessary. The HiSeq instrument computer is just beefier than its GA counterpart: two quad-core 64-bit processors, 48 GiB of RAM, and a 64-bit Microsoft Windows Vista operating system. For downstream analysis, Illumina will still offer their IlluminaCompute (turn-key sequence data analysis cluster) but also is strongly pushing cloud-based analysis solutions (specifically Amazon AWS). Illumina has altered GERALD so ELANDv2 can run using more than one process per lane. Alignment of 200 Gb of data using ELANDv2 takes about 30 hours using 64 cores.
The good and the bad of this instrument is that it is really just more of the same. Illumina has taken the optics from iScan and combined that with the fluidics and chemistry of the GA. This means the system is more likely to “work” at launch than those of us dealing with new sequencing platforms are used to. It also means the data will be familiar (just more of it) and therefore will suffer from the same limitations (increasing errors with read length, short insert sizes). Shrinking from the bleeding edge of the GA in terms of cluster density and read length means the HiSeq likely has significant head room to increase well beyond 200 Gb/run. A quick back of the envelope calculation pushing the HiSeq to 600,000 clusters/mm2 and 2×150 read lengths results in 450 Gb/run. (Again, that is my rough calculation and not any sort of promise from Illumina.) So, while it may be more of the same, it is likely that it will be a lot more of the same. The ability to sequence a tumor and normal genome from an individual in a single instrument run in about a week is really going to change the calculation (and economics) for cancer sequencing going forward.
Update: The above text has been corrected to state that QSEQ files are about 2.5 B/b. It is the entire RTA output that is 10 B/b.
Update2: I’ve added some links.