June 13th, 2008
If you are at all familiar with the massively parallel sequencing platforms, you know that the raw data, the images, look a lot like star field pictures. Once you process all these images, you get DNA sequences, called “reads”. For these technologies, the reads are typically short, 30 to 50 bases, which presents difficulties when trying to align these reads back to a reference genome. For one, the shorter the sequence, the less unique it is, i.e., the more places in the reference genome it can match. However, the data is not the only problem. The reference genome is a problem too. While each of us has two different versions of each autosomal chromosome (i.e., you have a diploid genome), when you download the human reference genome, you get a single base at each position (i.e., a haploid reference) (there are a few exceptions where an alternative haplotype is provided as a separate sequence). The reference is distributed in FASTA format which can support the full IUPAC DNA codes and therefore would allow one to specify multiple nucleotides at a given position, e.g., R means G or A. Unfortunately, most aligners do not support the full set of codes by default (one exception is Mosaik).
Not supporting the full set of nucleotide codes is only part of the problem of representing variation in a reference genome. Insertions and deletions (indels) and more complex structural variations cannot be represented in a single FASTA reference. As is the case for the alternative haplotypes in the human reference sequence, you can have multiple haplotypes in a reference but they are represented as multiple, distinct sequences which often have portions of their sequences that are identical (especially at the beginning and end). This presents a difficulty for short-read aligners which are trying to uniquely align a 30 b read. If the same 30 b sequence appears in multiple sequences in the reference, the aligner has no way of knowing whether the identical reference sequences are truly different or just fragments of alternative haploypes. Fortunately, there are other formats and representations that can accommodate indels, rearrangements, and repeats without artificially introducing repetitive sequence. One such format is de Bruijn graphs. Several assemblers for short-read sequences already use de Bruijn graphs to represent their assemblies, e.g., EULER and Velvet. One could imagine that the overlap engine from these assemblers could be converted into an algorithm to align sequence to a reference represented as a de Bruijn graph. While de Bruijn graphs may be overkill for such a task, there are undoubtedly efficient ways to represent multiple, complex haplotypes and align reads against them.
Another challenge for aligners is aligning cDNA sequence to a transcriptome reference. Captured mRNA used to create a cDNA library can be in a variety of states, e.g., before and after splicing, and represent various splice isoforms. Representing all possible processing states and splice isoforms in a FASTA reference is extremely difficult. When you try to do it, you end up with a situation similar to the alternative haplotypes mentioned above: identical sequences representing the same parts of the reference confuse the aligner. Fortunately, the same solution to this problem discussed above can be applied here too.
Why go to all the effort of developing an aligner that could align to a complex, n-ploid reference? If such an aligner were available, the detection of novel variants would be enhanced because you would not be penalized when aligning against known SNPs or indels and therefore would be better able to detect unknown variations in close proximity to known ones. You would also be better able to interpret alignment results because by reducing ambiguity in your reference, you would reduce ambiguity in your results.