Monday, December 10, 2012

Newick tree format parser in Scala

Parser combinators in Scala are nice for small, non-time-critical parsers. While typically slower than parser generators such as ANTLR or hand-written recursive descent parser (see also here), they are easier to implement and do not require external syntax or output files. If speed is an issue there is also Parboiled, a parser generator very similar in style to the parser combinators in Scala but apparently faster. I haven't tried it, however.

I needed a parser to read phylogentic tree data and Scala's parser combinators seemed just right. Without going into any details concerning phylogenetic trees, the following figure shows an example tree with a root node R, two internal nodes X and Y, and the leaf nodes A,B,C,D. Each branch has a length annotation, e.g. 0.1.

There are various formats to describe phylogentic trees but to keep it simple let's start with a format defined by the following EBNF:

  tree    ::= identifier length [subtree]
  subtree ::= "(" tree {"," tree} ")"
  length  ::= ":" floatingPointNumber 

Here a tree has a node identifier followed by the branch length (distance to parent), followed by an optional subtree. A subtree is a sequence of trees enclosed in brackets and separated by commas. The branch length is given by a colon followed by a floating point number.

Assuming branch lengths of 0.1 for all nodes (except the root node R), the tree above would be written as

R:0.0 (X:0.1 (A:0.1, B:0.1), Y:0.1 (A:0.1, B:0.1))

The given EBNF can easily be translated into a parser combinator that can read this tree description:

import scala.util.parsing.combinator._

class SimpleParser extends JavaTokenParsers {
  def tree                = ident ~ length ~ opt(subtree)
  def subtree:Parser[Any] = "(" ~> repsep(tree, ",") <~ ")"
  def length              = ":" ~> floatingPointNumber
}

Note that [subtree] translates to opt(subtree) and tree {"," tree} to repsep(tree, ","). Furthermore, subtree requires a return type, since it is a recursive function. I am lazy here and say Parser[Any], since we are going to improve on this code anyway. Let's add a method read() to read a tree description and return the parsing result:

class SimpleParser extends JavaTokenParsers {
  def tree = ident ~ length ~ opt(subtree)
  def subtree:Parser[Any] = "(" ~> repsep(tree, ",") <~ ")"
  def length = ":" ~> floatingPointNumber

  def read(text:String) = parseAll(tree, text)
}

val parser = new SimpleParser()
println( parser.read("R:0.0(X:0.1(A:0.1,B:0.1),Y:0.1(C:0.1,D:0.1))") )

The output is a bit convoluted but the structure reflects the tree structure. Note that opt() returns an Option, which is the reason for the Some and None in the output:

parsed: ((R~0.0)~Some(List(((X~0.1)~Some(List(((A~0.1)~None), ((B~0.1)~None)))), ((Y~0.1)~Some(List(((C~0.1)~None), ((D~0.1)~None)))))))

So far so good, but the current parser output is not very useful. What we want to get back from the parser is a proper data structure of nodes and their children. In the next steps we are going to refactor and extend the current code a bit. We start with a Node class that has a display() method to print a tree:

case class Node(name:String, length:Double, descendants:List[Node]) {
  def display(level:Int=0) {
    printf("%s%s:%.2f\n","  "*level,name,length)
    descendants.foreach(_.display(level+1))
  }
}

To keep the grammar separated from the rest and also to allow comments we factor out the read() method into an abstract super class that takes the node type as type parameter T:

abstract class TreeParser[T] extends JavaTokenParsers {
  val comment = """\\[.+?\\]"""

  def tree:Parser[T]

  def read(file:File):T = read(Source.fromFile(file).getLines().mkString("\n"))

  def read(text:String):T = parseAll(tree, text.replaceAll(comment,"")) match {
    case Success(result, next) => result
    case failure => throw new Exception(failure.toString)
  }
}

Comments are everything between two rectangular brackets (defined by a non-greedy, regular expression) and removed from the input text before running the parser. We also overload the read() method to read a tree from a file and tree is an abstract method to be implement by a subclass. Note that the read() method returns a type T (a node), which will be the root node of the tree.

parseAll() returns a ParseResult with sub classes Success, Failure or Error. We are only interested in Success and match against it to retrieve the parsing result (the root node of type T). In all other cases an exception is thrown, showing the parsing error.

Now we can rewrite the SimpleParser class as follows:

class SimpleParser[T](nf: (String,Double,List[T]) => T) extends TreeParser[T] {
  def tree = (ident ~ length ~ opt(subtree)) ^^ {case n~l~d => nf(n,l,d.getOrElse(Nil))}
  def subtree:Parser[List[T]] = "(" ~> repsep(tree, ",") <~ ")"
  def length = ":" ~> floatingPointNumber ^^ { _.toDouble }
}

There are two main difference to the previous implementation. First, a node factory (nf) is provided that takes a node name (String), the branch length (Double) and a list of child nodes (List[T]) to create a node of type T. This allows us to use the same parser to create trees of different node types, which is especially handy for testing. Note that the function subtree() now has a tight return type Parser[List[T]] and is not Parser[Any] anymore.

The second change is that actions have been added to the parsers. { _.toDouble } returns a Double for the parsed floating point number of the branch length and {case n~l~d => nf(n,l,d.getOrElse(Nil))} calls the node factory to create a node. If there is no subtree the list of descendants is Nil (d.getOrElse(Nil)).

If we run the parser now, we get back a tree structure of nodes to play with. I am a bit sneaky with the node factory and simply provide the apply() method of the node singleton that is automatically generated for the case class.

val parser = new SimpleParser(Node.apply)
val root = parser.read("R:0.0(X:0.1(A:0.1,B:0.1),Y:0.1(C:0.1,D:0.1))")
root.display()
R:0.00
  X:0.10
    A:0.10
    B:0.10
  Y:0.10
    C:0.10
    D:0.10

Now we are ready for the real deal: A parser for phylogenetic trees in Newick format. I took the following grammar from here and adapted it bit.

  tree        ::= subtree ";"
  descendants ::= "(" subtree {subtree ","} ")"
  subtree     ::= descendants name length | leaf
  leaf        ::= name length
  name        ::= [quoted | unquoted]
  unquoted    ::= identifier
  quoted      ::= "'" { not "'" | "''"} "'"
  length      ::= [":" floatingPointNumber]

From that we can derive the corresponding parser:

class NewickParser[T](nf: (String,Double,List[T]) => T) extends TreeParser[T] {
  def tree = subtree <~ ";"
  def descendants:Parser[List[T]] = "(" ~> repsep(subtree, ",") <~ ")"
  def subtree = descendants~name~length ^^ {case t~n~l => nf(n,l,t)} | leaf
  def leaf = name~length ^^ {case n~l => nf(n,l,Nil)}
  def name = opt(quoted | unquoted) ^^ { _.getOrElse("") }
  def unquoted = ident
  def quoted = """'([^']|'')*'""".r  ^^ { _.drop(1).dropRight(1).replace("''","'") }
  def length = opt(":" ~> floatingPointNumber) ^^ { _.getOrElse("0").toDouble }
}

The Newick format allows to leave out node names or branch length information, which can lead to rather wicked tree definitions such as

"(A,(B,C)E)F;"
(((,),));
(,B)C:0.3;

With the lengthy introduction in mind the NewickParser class is hopefully not beyond comprehension. The only difficult and admittedly ugly part is the action { _.drop(1).dropRight(1).replace("''","'") }, which simply removes the enclosing quotes from a quoted identifier and replaces pairs of single quotes ('') by single quotes. The quoted parser could be implemented more nicely as:

def quoted = "'" ~> """([^']|'')*""".r <~ "'"  ^^ { _.replace("''","'") }

but in this case leading and trailing white spaces of quoted identifiers are lost, which might be acceptable. The astute reader will notice two other limitation. Firstly, nested comments, e.g. [ a comment [ another comment]] are not permitted and identifiers cannot contain brackets. I have not found a really nice way of supporting this functionality. Either one extends the grammar to allow comments where ever white spaces can appear, which makes the grammar unwieldy. Or one overrides the white space parser by a custom parser that supports comments. Or the last option that I chose is to implement a separate comment parser to remove comments from the input text instead of the regular expression employed in TreeParser.

class CommentParser extends JavaTokenParsers {
  override val skipWhitespace = false

  /** removes comments from given text */
  def remove(text:String) = parseAll(rest, text)

  def rest = rep(quoted | comment | any) ^^ { _.mkString }
  def any = """.""".r
  def quoted = """'([^']|'')*'""".r
  def comment: Parser[String] = ("["~rep(not("]")~(comment | any))~"]") ^^ {_ => ""}
}

I found the increase in complexity not worth the additional functionality and stayed with the limited parser that does not allow nested comments.

Finally, some links I came across recently that I found useful for parser developement:
Code examples for combinator parsing
Ten grammars for simple integer arithmetic expressions

Wednesday, October 3, 2012

Counting nucleotides

Rosalind is a nice web site I recently discovered. Like 99 scala problems, Topcoder or similar sites it poses little programming challenges but with a special focus on bioinformatics.

The first and simplest programming challenges is to count the nucleotides within a DNA sequence (see here). For instance, given a sequence "AGCTTTTCATTCTGACTGC", what are the frequencies of the four nucleotides A,C,T,G?

It is a trivial problem, I used to solve with a dictionary but reading the problem description, which requires an ordered output of the frequencies, I realized that there is an even more elegant solution.

An efficient but ugly solution would be to count the individual nucleotide occurrences within a loop, e.g.

s = "AGCTTTTCATTCTGACTGC"
A,C,T,G = 0,0,0,0
for e in s:
    if e=='A': A +=1
    if e=='C': C +=1
    if e=='G': T +=1
    if e=='T': G +=1            

print A,C,T,G

The repetition of the four ifs is not nice and could be replaced as follows

s = "AGCTTTTCATTCTGACTGC"
idx = {'A':0, 'C':1, 'G':2, 'T':3}
counts = [0]*4
for e in s:
    counts[idx[e]] += 1           
print counts 

However, it remains a rather ugly solution and I find the following line much more elegant

print map(s.count, "ACTG") 

The good news is, it is even pretty efficient. The sequence gets parsed 4 times and the complexity is therefore O(4n) = O(n), which is not worse than the complexity of the dictionary approach:

counts = dict()
for e in s:
    counts[e] = counts.get(e,0)+1
print counts  
print map(s.count, "ACTG") 

Just for fun and comparison the same in Scala. First the dictionary approach

def counts = s.foldLeft(Map[Char,Int]() withDefaultValue 0){
  (m,c) => m + ((c,m(c)+1))
}
println(counts)

which is rather convoluted. Less efficient but much more readable is the version that uses "groupBy":

val s = "AGCTTTTCATTCTGACTGC"
def counts = s.groupBy(identity).mapValues(_.size)
println(counts)

and mapping the count function onto the nucleotides again leads to the most elegant code

val s = "AGCTTTTCATTCTGACTGC"
val counts = "ACTG".map(ch => s.count(_==ch))
println(counts)

A slightly different version I saw on Rosalind

val s = "AGCTTTTCATTCTGACTGC"
println( "ACGT" map (s count _.==) mkString " " )

Monday, September 24, 2012

Permutations in Python and Scala

A permutations of a collection of objects describes all possible, ordered arrangements of those objects. For instance, all permutations of (1,2,3) are (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), and (3,2,1). The following Python code generates all permutation of the elements in the given list:
def permutations(ls):
    if len(ls) == 1: yield ls
    for i in xrange(len(ls)):
        for p in permutations(ls[:i]+ls[i+1:]):
            yield [ls[i]]+p
Here a usage example:
ls = [1,2,3]            
for p in permutations(ls):
    print p
[1, 2, 3]
[1, 3, 2]
[2, 1, 3]
[2, 3, 1]
[3, 1, 2]
[3, 2, 1]
Now, let's compute permutations in Scala (2.9):
def permutations[T](ls: List[T]): List[List[T]] = ls match {
  case List() => List(List())
  case _ => for(e <- ls; r <- permutations(ls filterNot(_==e))) yield e::r
}
In contrast to the Python version the Scala version above is not a generator and creates all permutations in memory. While the "yield" keyword in Python defines a generator in Scala it is a way to describe a list comprehension. To create a generator version Iterators can be used:
def permutations[T](ls: List[T]): Iterator[List[T]] = ls match {
  case List() => Iterator(List())
  case _ => for(e <- ls.iterator; r <- permutations(ls.filterNot(_==e))) yield e::r
}
And a more compact but slightly less readable version:
def perms[T](ls:List[T]): Iterator[List[T]] =
    if(ls.isEmpty) Iterator(sl) else for (h <- ls.iterator; t <- perms(ls.filterNot(_==h))) yield h::t
The time complexity is n factorial (O(n!)), which is easy to see. Let's say there is one element, then there is only one permutations. If there are two elements, the there is two possible permutations. With three elements let's hold one, that there is two left, which gives us two permutations times three. Obviously the sequence is 1*2*3*...*n, which is n!. BTW, this can be beautifully written in Scala:
def fac(n:Int) = (1 to n) reduceLeft (_ * _)
or more cryptically-compact using left fold instead of reduce:
def fac(n:Int) = (1/:(2 to n))(_*_)

Sunday, September 23, 2012

Sync Keepass on Android and PC

Synchronize Keepass on Android and PC

Keepass is a great application to manage passwords. Even better there is the corresponding application keepassdroid for the Android platform. With Google drive there is now an easy way to keep the password database synchronized across several PCs and phones.
Simply put the database file (.kdb or .kdbx) on the Google Drive and use it as the default database in Keepass. On the phone it is beneficial to make the database available offline (by pinning) it. Otherwise, without a connection or a data plan the passwords would not be available.
It is tricky to find the location of the Google Drive. Simple tap the database file and the Keepassdroid opens, shows the file path and allows you to make it the default database.

Wednesday, September 19, 2012

Fun with cartesian products

Cartesian product is a fancy term for all possible, ordered combinations of things taken from 2 or more collections of things (There is an exact mathematical definition for it but I don't bother you with it.) The Cartesian product is frequently used to define a grid of coordinates. For instance the following code (Python 2.7) prints out the Cartesian product of the two lists xs and ys:
xs = [1,2]
ys = [1,2,3]
for x in xs:
    for y in ys:
        print x,y
And here is the output:
1 1
1 2
1 3
2 1
2 2
2 3
Simple enough. Just two nested loops for two dimensions. This can be written very nicely as a list comprehension
product = [(x,y) for x in xs for y in ys]
which gives you
[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3)]
However, it gets more challenging if you want to generalize to arbitrary numbers of dimensions. An iterative implementation is rather ugly but the recursive version is beautiful (though hard to remember):
def product(ls):
    if not ls: return [[]]
    return [[e]+p for e in ls[0] for p in product(ls[1:])]
And here a usage example:
ls = [[1,2], ['a','b','c'], ['A','B']]
for p in product(ls): 
    print p
[1, 'a', 'A']
[1, 'a', 'B']
...
[2, 'c', 'A']
[2, 'c', 'B']
How does it work? If the input list ls is empty a list of empty lists is returned ([[]]). Otherwise a list comprehension is performed, where the elements ([e]+p) are the elements (e) of first list (ls[0]) within the input list, prepended to the cartesian products (p) of the remainder (ls[1:]) of the input list. While really nice, it has the disadvantage that the entire Cartesian product is created in memory. Frequently, only the individual products are needed and it would be much more efficient, if one could iterate over them. The following implementation show the generator version:
def product(ls):
    if not ls: return iter( [[]] ) 
    return ([e]+p for e in ls[0] for p in product(ls[1:]))
Often only a more constraint version of this very general implementation is need. Here, for instance, a binary counter with n digits:
def bin_counter(n):
    if not n: return iter( [[]] ) 
    return ([e]+p for e in [0,1] for p in bin_counter(n-1)
or a bit more general, a counter with arbitrary digit ranges:
def counter(ls):
    if not ls: return iter( [[]] ) 
    return ([e]+p for e in xrange(ls[0]) for p in counter(ls[1:]))
ls = [2,3]                
for p in counter(ls): print p
[0, 0]
[0, 1]
[0, 2]
[1, 0]
[1, 1]
[1, 2]
Finally, let's do the same in Scala (2.9). First the general version of the Cartesian Product:
def product[T](ls: List[List[T]]): List[List[T]] = ls match {
  case Nil => List(List())  
  case h::t => for(e <- h; r <- product(t)) yield e::r
}  
  
val ls = List(List(1,2),List(1,2,3))
println( product(ls) )
The type declarations make it a bit more wordy but type-safe and it still looks elegant. The main difference to the Python version is, that the dimensions need to be of the same type (e.g. all Ints or Char but not mixed). The counter implementation is very similar. Apart from the different type (List[Int]) the main difference is the usage of List.range(0,h). Alternatively, e <-(0 to h).toList, could have been used but the required conversion to a list would make it ugly.
def counter(ls: List[Int]): List[List[Int]] = ls match {
  case Nil => List(List())  
  case h::t => for(e <- List.range(0,h); r <- counter(t)) yield e::r
}

val ls = List(2,3)
println( counter(ls) )
None of the two Scala implementations above is a generator though. Don't get deceived by the "yield " keyword, which defines a list comprehension in Scala but not a generator as in Python. However, creating a version that generates products or counts on demand using an iterator is easy; here for the counter:
def counter(ls: List[Int]): Iterator[List[Int]] = ls match {
  case Nil => Iterator(List())  
  case h::t => for(e <- Iterator.range(0,h); r <- counter(t)) yield e::r
}

val ls = List(2,3)
counter(ls).foreach(println)

That's it.