DNA sequences are composed of four nucleotides (Adenine = A, Cytosine = C, Thymine = T, Guanine = G). A common representation of a DNA sequence is a simple string, which is fine for short sequences but becomes memory-intensive for complete genomes. For instance, the human genome consists of about 3.2 billion base pairs, which equates to roughly 3GB if each nucleotide is stored as a byte. However, genomes contain various types of
repeated subsequences, and repetitive DNA sequences comprise approximately 50% of the human genome.
While there are many ways to take advantage of the redundant nature of genomic sequences, here I want to play with a simple implementation of a DNA sequence class in Scala that aims to store sequence data more efficiently by performing a run-length encoding. Apart from more complex repeats of sub-sequences, DNA sequences also contains many stretches of repeated nucleotide symbols. Here a fragment of
Escherichia coli:
TGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCG...
Instead of storing the sequence as a string with repeated characters a run-length encoding would store each character and the number of times it appears. This way a sequence such as GGCAGGGG could be represented as (G,2)(C,1)(A,1)(G,4). Since 4 nucleotides can be represented by 2 bits and most stretches of repeated characters are short (a 6 bit number should do), one could pack a run into one byte. But here I want to keep things super simple and represent a run by a case class
case class Run(symbol:Char, location:Int)
where symbol is the nucleotide symbol and location is the position of the first character of the run. Why storing the location of the nucleotide and not the number of repeats as suggested above? The reason is that we want to have random access on the encoded sequence. We need to be able to efficiently retrieve a symbol at a specific location in the genome. A run-length encoding would require a linear search of the runs to find the run that contains the location. Here we store the location directly and can infer the length of the run from the difference in the locations of two subsequent runs. To give an example: the sequence
symbols: GGCAGGGG
locations: 01234567
would be encoded as Run(G,0)Run(C,2)Run(A,3)Run(G,4). Before we look at the Scala code for the encoding let us add one refinement. You will notice that for the last run we actually don't know the run length, since there is no subsequent run with start position we could subtract from. To avoid this problem we add a terminator symbol '$' to the end of the sequence
symbols: GGCAGGGG$
locations: 012345678
and the encoding becomes Run(G,0)Run(C,2)Run(A,3)Run(G,4)Run($,8). Here the implementation of the function to encode a DNA sequence:
import scala.collection.immutable.Seq
def encode(dna: Iterable[Char]) = {
var runs = Seq(Run(dna.head, 0))
for ((s, p) <- (dna ++ Iterable('$')).zipWithIndex if runs.last.symbol != s)
runs :+= (Run(s, p))
runs
}
We will store a collection of runs as an immutable sequence and therefore import immutable.Seq. The encode function takes an Iterable over characters, which allows us to read a large genome from file without actually having to read it completely into memory first. The collection of runs is initialized with the first DNA symbol (dna.head) at location zero. Biologists actually use locations starting from 1 but we avoid this additional complexity here.
Before looping over the DNA symbols we add the sequence terminator '$' to the iterator (dna ++ Iterable('$')) and then zip with the symbol locations. In each iteration we test whether the last symbol in the runs collections is different from the current symbol (runs.last.symbol != s) and if that is the case, add a new run to the collection (runs :+= (Run(s, p)).
Using a for-loop to construct the sequence of runs is not really functional style. We could use fold to achieve the same but the code becomes essentially unreadable:
def encode(dna: Iterable[Char]) = {
(dna ++ Iterable('$')).zipWithIndex.foldLeft((End, 0, Seq(Run(seqdnahead, 0))))((o, e) =>
if (o._1 == e._1) o else (e._1, e._2, o._3 :+ Run(e._1, e._2)))._3
}
While this can be decomposed to make it a bit more readable, it remains a challenge to the eye:
def encode(dna: Iterable[Char]) = {
val runs = Seq(Run(dna.head, 0))
def acc(o: (Char, Int, Seq[Run]), e: (Char, Int)) = (o,e) match {
case ((os, ol, or), (s,l)) => if (os == s) o else (s, l, or :+ Run(s, l))
}
(dna ++ Iterable('$')).zipWithIndex.foldLeft((End, 0, runs))(acc)._3
}
Let us stick with the most readable version and implement a class to create encoded DNA sequences:
import scala.collection.immutable.Seq
class DNA(protected val runs: Seq[Run])
object DNA {
private val End = '$'
def apply(dna: Iterable[Char]) = {
var runs = Seq(Run(dna.head, 0))
for ((s, p) <- (dna ++ Iterable(End)).zipWithIndex if runs.last.symbol != s)
runs :+= (Run(s, p))
new DNA(runs)
}
}
Now we can create a DNA sequence by calling DNA('GGCAGGGG'). In the next step we make the class an Iterable[Char], which will come in handy later. We need to implement two methods: length and iterator:
class DNA(protected val runs: Seq[Run]) extends Iterable[Char] {
def length: Int = runs.last.location
override def iterator =
for (i <- (0 to runs.length-2).iterator;
_ <- runs(i).location until runs(i+1).location) yield runs(i).symbol
}
As you can see, length is easy. We simply take the location of the last run, which is equivalent to the length of the decoded sequence. The iterator is a bit more tricky. We have an outer loop over all runs except the last one and convert to an iterator ((0 to runs.length-2).iterator) because we need to return an iterator. The inner loop returns the repeated symbols and their number is inferred from the location difference of the current and the next run (runs(i).location until runs(i+1).location).
With an iterator over the sequence symbols at hand, implementing the toString method becomes trivial:
class DNA(protected val runs: Seq[Run]) extends Iterable[Char] {
...
override def toString() = iterator.mkString
}
We can also very easily concatenate DNA sequences
class DNA(protected val runs: Seq[Run]) extends Iterable[Char] {
...
def ++(seq: DNA) = new DNA(runs ++ seq.runs)
}
Note that concatenation returns a new DNA sequence and not a string. This is already very nice and we can now create, concatenate and print sequences
print(DNA('ACTG') ++ DNA('TTGAGG'))
but one fundamental method is still missing. We want to be able retrieve a symbol at an arbitrary location of the DNA sequence. We could iterate over the runs, find the run that contains the requested location and print out its symbol
class DNA(protected val runs: Seq[Run]) extends Iterable[Char] {
...
def apply(location: Int): Char = {
for (i <- (1 to runs.length-1))
if(runs(i).location > location)
return runs(i-1).symbol
DNA.End
}
}
but it would be of O(n) complexity. However, since the collection of runs is sorted in increasing order of location we can employ binary search and improve to order O(log n). First we implement a method runIndex that returns the index of the run that contains the location (or the last run) and with that the implementation of apply is simple
private def runIndex(location: Int) = {
def bsearch(lo: Int, hi: Int): Int = {
if (lo > hi) return hi
val mid = lo + (hi - lo) / 2
runs(mid) match {
case Run(_, l) if (l == location) => mid
case Run(_, l) if (l < location) => bsearch(mid + 1, hi)
case Run(_, l) if (l > location) => bsearch(lo, mid - 1)
}
}
bsearch(0, runs.length - 1)
}
def apply(location: Int): Char = runs(runIndex(location)).symbol
Also beautifully simple is now the implementation a method to retrieve a slice of the DNA sequence
override def slice(start: Int, end: Int) = DNA(start until end map apply)
However, the method is rather inefficient since it performs a binary search for each element of the slice, creates a string and the converts the string to runs again. It would be more efficient to find the first and the last run of the slice, take the runs between and create a new DNA sequence from that. Here we go:
override def slice(start: Int, end: Int) = {
var runsSlice = Seq[Run]();
val (first, last) = (runIndex(start), runIndex(end))
runsSlice :+= Run(runs(first).symbol, 0)
for (i <- first+1 to last; run = runs(i))
runsSlice :+= Run(run.symbol, run.location-start)
runsSlice :+= Run(DNA.End, end-start+1)
new DNA(runsSlice);
}
Note that we have to correct the first and last run in case that the start or end position of the slice does not match the run location exactly. Finally let us glue it all together to make it a complete piece of code:
import scala.collection.immutable.Seq
case class Run(symbol:Char, location:Int)
class DNA(protected val runs: Seq[Run]) extends Iterable[Char] {
private def runIndex(location: Int) = {
def bsearch(lo: Int, hi: Int): Int = {
if (lo > hi) return hi
val mid = lo + (hi - lo) / 2
runs(mid) match {
case Run(_, l) if (l == location) => mid
case Run(_, l) if (l < location) => bsearch(mid + 1, hi)
case Run(_, l) if (l > location) => bsearch(lo, mid - 1)
}
}
bsearch(0, runs.length - 1)
}
def apply(location: Int): Char = runs(runIndex(location)).symbol
def length: Int = runs.last.location
override def iterator =
for (i <- (0 to runs.length-2).iterator;
_ <- runs(i).location until runs(i+1).location) yield runs(i).symbol
def ++(seq: DNA) = new DNA(runs ++ seq.runs)
override def slice(start: Int, end: Int) = {
var runsSlice = Seq[Run]();
val (first, last) = (runIndex(start), runIndex(end))
runsSlice :+= Run(runs(first).symbol, 0)
for (i <- first+1 to last; run = runs(i))
runsSlice :+= Run(run.symbol, run.location-start)
runsSlice :+= Run(DNA.End, end-start+1)
new DNA(runsSlice);
}
override def toString() = iterator.mkString
}
object DNA {
private val End = '$'
def apply(dna: Iterable[Char]) = {
var runs = Seq(Run(dna.head, 0))
for ((s, p) <- (dna ++ Iterable(End)).zipWithIndex if runs.last.symbol != s)
runs :+= (Run(s, p))
new DNA(runs)
}
}
Is the run-length encoding implemented here actually saving any memory? In most cases probably not! This is a simplistic, toy implementation and the fraction of longer repetitive stretches of DNA symbols is not that large. It is possible to create longer stretches of repeated nucleotides using the Burrows–Wheeler transform that can be computed in linear time. And should you find the binary search too slow you could try an interpolation search.
Finally, there are more advanced, more efficient, but also more complex compression methods for DNA sequences available (see here).