I am returning from a meeting on ultra-high throughput sequence (UHTS) data exchange. I know what you are thinking, "That sounds like an exciting two days!" UHTS (or next-gen sequencing or massively parallel sequencing) has greatly expanded the application of genome sequencing across a wide spectrum of biologic research. What used to be reserved for large sequencing centers and large international projects is now in the purview of individual researchers. The economics of sequencing are to the point where, while still more costly than lower resolution techniques like expression arrays, the additional information they provide makes them compelling even at higher cost. Thus, experiments that would have traditionally been carried out on these array platforms are now considered as potential sequencing investigations.

Perhaps an example is in order to make these abstractions more concrete. Currently, expression arrays are used to measure relative levels of gene expression in tissues. More specifically, often the expression (transcription) of genes in two different tissues, e.g., liver and pancreas or tumor and normal, are measured to determine which genes are being expressed at different levels in each of the tissues. These expression arrays contain tags of DNA complimentary to specific genes, or more specifically exons in genes. The mRNA in the tissue is collected, converted into cDNA, and the cDNA is hybridized onto the array. The degree of hybridization of each probe is measured and you obtain relative expression levels of of each exon (gene) represented on the array. In the case of tumor versus normal tissue, what you do not get is any indication as to what mutation might have caused the expression to increase or decrease in one tissue compared to the other. To get that information, you would have to selectively sequence that portion of the genome, usually a few hundred to a few thousand bases per probe. The cost of this more complete approach can be high because low resolution probes and/or lots of discrepant expression necessitate a large amount of sequencing; and this sequencing is still primarily done with more expensive, capillary-based methods. What UHTS provides is a way to short cut this approach. You can directly sequence the cDNA, thereby obtaining both the sequence and the relative expression levels. The latter is obtained by aligning each sequence, or "read", generated against all genes. The genes that have more reads aligned to them are expressed more than those that have fewer reads aligned to them.

So as the example indicates and as I said above, experiments that have traditionally been carried out on array platforms can now be done on UHTS platforms. However, both of these platforms have grown up in different cultures with different standards and rules. This meeting was sponsored by the array folks (MGED) and the sequencing folks (Genomics Standards Consortium) to get these two groups together to either merge their current standards or come up with new standards that both groups agree on.

When the meeting started, the different cultures were apparent. The array folks have largely focused on metadata standards, i.e., data about the sample and the experiment, whereas the sequencing folks have not worried much about metadata. This difference is expected because the number and type of array experiments is large whereas sequencing has heretofore been focused on relatively fewer, larger, long-term projects. More recent sequencing efforts like The Cancer Genome Atlas have necessarily had to deal with more metadata, e.g., patient sample information, and the sequencing community has largely relied on ad-hoc standards that have differed from project to project. So this meeting was timely for both the array folks and the sequencing folks. The array folks would like to have similar standards they have enjoyed for the microarray data, e.g., MAGE-TAB and MIAME, as they transition to UHTS. And the sequencing folks would like some standards to develop around metadata exchange as the number of projects they participate in and the amount of metadata they report on increase.

Other aspects of UHTS experiments were also discussed: the different types of experiments being done on UHTS, the short-read format, vendor-specific data formats (Applied Biosystems, Illumina, and Helicos were represented), the short-read archives of NCBI and EBI, and the need for standards in general. But these topics were largely informational and the group more or less organically decided to focus on the metadata standards, of which several flavors were presented. I won't get into the details, but suffice to say there are people who like XML (computer scientists) and people who like tab-delimited (biologists) and people who will end up having to support both (me). The meeting was a positive first step but, as with all meetings of this sort, it will not be known for some time if it was a success or not. Time and effort will determine if a standard will be developed and adopted by everyone involved.