Sunday, May 26, 2013

Reading FASTA files with Scala

FASTA is a simple file format to store nucleotide or peptide sequences. A sequence in FASTA format starts with a greater-than sign followed by the sequence name (and possibly other information) and then several lines of nucleoide or amino acid letters. A multi-FASTA file contains multiple FASTA sequences. Here an example taken from wikipedia:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

It is a common task to read FASTA sequences from a file and I was aiming for an elegant/short implementation in Scala. Here is a recursive version I like:

  
  def readFasta(filename: String) = {
     import scala.io.Source
     def parse(lines: Iterator[String]): List[(String,String)] = {
       if(lines.isEmpty) return List()
       val name = lines.next.drop(1)
       val (seq,rest) = lines.span(_(0)!='>')
       (name, seq.mkString)::parse(rest)
     }
     parse(Source.fromFile(filename).getLines())
  }

It defines an internal function parse that takes an iterator over the lines in the file (generated by Source.fromFile(filename).getLines() and returns a list of tuples, where each tuple contains the sequence name and the string of nucleotide or amino acid letters.

The recursion stops when the iterator is empty (lines.isEmpty) otherwise a line is read, which must be the name and the first letter is dropped to get rid of the greater-than sign (lines.next.drop(1)). After the line with the name span(_(0)!='>') is used to take the following lines until the next sequence name is found and store them in seq. _(0)!='>' tests whether the first letter of the line is the greater sign. The remaining lines are stored in rest.

Actually, "storing" is the wrong word, since Scala beautifully returns two new iterators when span is called on an iterator. It is worth noting, that nowhere a list of the read lines is stored. It is turtles/iterators all the way through except for the last line ((name, seq.mkString)::parse(rest)), where the return list is created. (name, seq.mkString) builds a tuple with the sequence name and the the sequence letters - the latter is constructed by gluing all sequence lines together with mkString. This tuple is the head of the return list and the tail is build recursively by calling parse for the remaining lines.

Let's add three minor improvements. It will make the code less elegant but more robust. Firstly, lines could have leading or trailing white spaces. They shouldn't but hey, you know how it is. To play it save we add a to the part where the lines are read from the file (last line in code). Secondly, there could be empty lines that we filter out via filterNot(_.isEmpty()).

Thirdly, the line with the sequence name frequently contains additional annotation that is typically separated by the pipe "|" symbol. But there is really no solid standard and it could be in any format - and it will. We therefore define a regular expression to extract the name from the header line and tweak it if necessary. The expression """>(.+?)(\\|.+)?""".r has two groups. The first captures the name the second captures any annotation starting with the pipe "|" symbol, which needs to be escaped (because it functions as OR in Regex language). The final code looks like this:

  
  def readFasta(filename:String) = { 
     import scala.io.Source
     val header = """>(.+)(\\|.+)?""".r
     def parse(lines: Iterator[String]): List[(String,String)] = {
       if(lines.isEmpty) return List()
       val header(name,annotation) = lines.next
       val (seq,rest) = lines.span(_(0)!='>')
       (name, seq.mkString)::parse(rest)
     }
     parse(Source.fromFile(filename).getLines().map(_.trim).filterNot(_.isEmpty()))
  }    

While elegant the above solution has the problem of being recursive (specifically not even tail recursive), which will cause a stack overflow for files with large numbers of sequences. The following iterative implementation is ugly but I haven't found any really nice alternative. Take it as a temporary implementation ;)

  
  def readFasta(filename:String) = {
    val header = """>(.+)(\\|.+)?""".r
    var lines = Source.fromFile(filename).getLines().filterNot(_.isEmpty())
    var sequences = List[(String,String)]()
    while(lines.hasNext) {
      val header(name,annotation) = lines.next
      val (seq,rest) = lines.span(_(0)!='>')
      sequences = (name, seq.mkString)::sequences
      lines = rest
    }
    sequences.reverse
  }  

Also interesting is this solution that uses parser combinators to read FASTA sequences.

Alexandre Masselot has posted an interesting problem (see below). What if you want an iterator over the sequences instead of a list? We can easily modify the code above to achieve this by creating and returning an iterator when calling readFasta. An iterator needs a hasNext method, which essentially replaces the condition of the while loop used above and a next method, which contains the body of the while loop:

  
  def readFasta(filename: String) = 
    new Iterator[(String, Iterator[Char])] {
    val header = """>(.+)(\|.+)?""".r
    var lines = Source.fromFile(filename).getLines().filterNot(_.isEmpty())
    def hasNext = lines.hasNext
    def next = {
      val header(name, annotation) = lines.next
      val (seq, rest) = lines.span(_(0) != '>')
      lines = rest
      (name, seq.flatMap(_.iterator))
    }
  }

If you look at the code closely, you will notice on more change. I have replaced (name, seq.mkString) by (name, seq.flatMap(_.iterator)). This is where Scala shines. Instead of concatenating the lines to a sequence via mkString we can simply call flatMap(_.iterator) to convert lines to iterators over line characters and then concatenate the iterators via flatMap.

In practice, you will find an iterator over sequences frequently useful, but an iterator over sequence characters less so. Many computational evaluations of sequence data will require random access. If memory consumption is an issue, compressing the sequence is probably a better way. Also in many cases a set of sequence data will be related and storing only the differences to a reference sequence (e.g. the human reference genome) can be a very efficient representation of large sequence data.