I recently received this request.

We've taken on 454 sequencing and would like information about the amount and type of bioinformatics support needed to help investigators with their data. Our goal is to outfit our bioinformatics analysis core with talent sufficient to align, assemble and annotate pathogen genomes and large sections of genomes, as well as targeted sequencing following up on association studies.

In truth, I have recently received a lot of requests like this. To help streamline the process of answering questions like this, I am going to post the answers here. Hopefully people will find it helpful and the people who asked the questions don't find it offensive.

Currently, the informatics needs of 454 sequencers are quite modest by next-generation sequencer (nextgen) standards. Currently, a 454 run takes about 8 hours and generates about 100 Mb (mega-bases) of sequence data with an average read length of about 225 b. Each 8-hour run generates about 13 GB of raw data (images). The primary analysis (image feature extraction and base calling) generates another 15 GB of data. Of the approximately 30 GB of total sequence data generated by a run and primary analysis, about 600 MB is the SFF files that will be needed by all subsequent analysis (the SFF files will likely soon be replaced by SRF files once the standard is stable). The SFF files contain the flow grams, base calls, and quality values (they are analogous to the SCF files for capillary-based sequencing) and can be compressed down to about 250 MB. In addition to the SFF files, the primary analysis generates about 10 kB of run metrics that may be useful for quality control, troubleshooting, etc.

You can use this information about the amount of data generated to inform your data retention policy. The decision about your retention policy essentially comes down to a cost benefit analysis; how much are you willing to spend to avoid the hassle of having to re-run samples, re-analyze data, restore from tape, etc. Once you do decide on a data retention policy, e.g., keep everything for 30 days and SFF files for one year, you can use the details of the policy, your projected run schedule, and the data size numbers to determine how much storage you will need to support the data produced by a single sequencer. For example, if you run the instrument twice daily and keep all data for 30 days, you will need about 2 TB of disk space just to operate the instrument,

  2 runs/day * 28 GB/run * 30 day retention * TB/1000 GB = 1.8 TB

and about 200 GB per year for all the SFF files generated. Of course, that is just to produce the data (see below).

As for computing power, the primary analysis can occur on the computer attached to the sequencer, although you have to wait for the analysis to complete (around four hours) before starting another run. If you do not want to wait, you can set up another modestly powerful computer (2 cores and 2 GiB of RAM), to perform the primary analysis off instrument. It should be noted that 454 is planning upgrades to their sequencers this year that will likely increase these numbers (raw bases, read length, and storage).

For data analysis beyond just getting sequences, 454 provides tools for aligning their reads back to reference genomes and doing de novo assembly that work directly on the SFF files and are pretty straightforward to use. Currently their assembler works on genomes up to about 50 Mb, but they are actively working to improve its performance on larger genomes. There are also third party tools, e.g., ssahaSNP, for aligning, assembling, finding variants in, etc. 454 data. All of these tool are under active development, so I would encourage anyone using 454 to try to remain software agnostic, periodically trying all the tools out there and seeing which ones do the best job for your type of analysis. This means building a processing pipeline that is flexible enough to support regularly plugging in and unplugging different tools. One perhaps non-obvious result of this type of approach is that you will need a little extra compute power and disk storage than if you had a stable, predictable analysis pipeline. How much disk space will you need? Unfortunately, it really depends on the type of analysis you want to do. How much computational power will you need? Again, it depends on the types of analysis you want to do and how many permutations you want to perform.

In addition to these 454-directed tools, the 454 read lengths are nearing capillary read length magnitudes. Once the reads get into the 400-500 base pair range, the world of FASTA-based, longer-read-tuned tools become usable. One caveat to the blind application of such tools to 454 data is that there error model in 454 reads is much different than the error model in capillary-based sequencing technologies. Many of these traditional tools, whether knowingly or unknowingly, have hard-coded assumptions about the error model into their logic. The converse is also true, i.e., 454-specific tools do a better job with 454 data than non-454 data because they have knowledge of the 454 error model built into them. For example, the 454 mapping tool, in general, does a better job aligning 454 reads to a reference than more generic tools, e.g., BLAST and BLAT. Unfortunately, the error model in 454 data is currently not fully captured by the single quality score per base paradigm widely used by much sequence analysis software. To be fair, the single quality score per base also has its shortcomings when applied to capillary-based sequencing.

The level of bioinformatics support you need for a 454 machine depends largely on what types of analysis you want to do. The type of people you need depend on how you want to do your analysis. If you just want to provide your investigators with alignments and assemblies of pathogen-size genomes, then the tools 454 provides will likely suffice and you will just need minimal support to run these tools and forward along the results in a format your investigators can handle. Once you have the alignments/assemblies, the normal tools you use to annotate the genomic data will likely be applicable. To estimate the support (both bioinformatics and IT), just be sure to factor in the much higher data rate at which the 454 machine will operate as compared to capillary-based sequencers. If you want to develop novel alignment, assembly, annotation, etc. techniques based on 454 data, then you will need different type of personnel, likely pairing bioinformaticians with software engineers that are able to translate the bioinformatic algorithms into time- and memory-efficient software that is able to keep up with the flow of data from the 454 sequencer.