Monday, May 22, 2017

Reading line-structured files (FASTA) with Python

Many files contain data in a line-structured format, where a line prefix indicates a record or fields in a record. An example are FASTA files, where ';' or '>' indicate a header or comment lines, followed by multiple lines with sequence data:

  
;LCBO - Prolactin precursor
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin 
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Files in these formats are surprisingly unpleasant/difficult to read if they are large and cannot be loaded completely into memory. The reason is that we need to use an Iterator to read the file line by line and cannot go back.

Before looking at the solutions below, give it a try yourself! The task is to read the file above (without loading it in memory via open().read()) and return tuples of the form (header, sequence), where header is only the first line of multiple header lines, if there are any, and sequence is the concatenation of all sequence data. In other words, we want a generator or iterator that produces the following tuples:

  
(';LCBO - Prolactin precursor', 'MDSKGSS...NNC*')

('>MCHU - Calmodulin', 'ADQLTEE...TAK*')

(>gi|5524211|gb|AAD44166.1', 'LCLYTHIG...ENY')

The sequence strings above are shortened (...) for readability. Note that newlines within the sequence data need to be removed and that there can be multiple blank lines. Here my solution in plain Python:

def read_seqs():
    header, seqs = None, []
    isheader, prev_isheader = False, False
    for line in open('fasta.txt'):
        line = line.strip()
        if not line: 
            continue
        prev_isheader = isheader
        isheader = line[0] in {'>', ';'}
        if isheader and not prev_isheader:
            if header:
                yield header, ''.join(seqs)
                header, seqs = line, []
        elif not isheader:
            seqs.append(line)
    yield header, ''.join(seqs)

Now we can read and print the sequences one-by-one:

for h,s in read_seqs():
    print '{}\n{}\n'.format(h,s)

However, the code is ugly, complex and it took me embarrassing long to write it. A simple task such as reading a line-structured file should not be that difficult and we will have a look at a much better solution later. But first a quick discussion of the code above.

We read the file 'fasta.txt' with open(), which gives us an iterator over the lines in the file. Newlines are removed and empty lines are skipped. isheader tells us if a line starts with a header prefix and prev_isheader contains this state for the previous line. If we find a header prefix and the previous line was not a header line, we start a new record. If it is not a header line we collect the sequence data. Note that we have to call yield twice. Once when we start a new record we return the record we have completed reading. And the second time when we are finished reading the file. Two flags, two yields, multiple nested if statements, and more than 10 lines to read a simple file is shockingly complex.

Luckily, with itertools the same problem can be solve much more elegantly. In just four lines (not counting imports), to be precise:

from itertools import groupby, izip

lines = (l.strip() for l in open('fasta.txt') if l.strip())
isheader = lambda line: line[0] in {'>', ';'}
chunks= (list(g) for _, g in groupby(lines, isheader))
seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

Now this is beautiful! Much shorter, more readable and easy to implement. The key is the groupby function that simplifies our life dramatically here. But let's start from the top.

The first line is a a generator that iterates over the lines in the file, strips newlines and skips empty lines. The second line defines a lambda function isheader that recognizes header lines. In the third line it gets interesting. groupby groups a stream of data (here lines) according to a key function. If you are not familiar with groupby, here a small example to understand it works:

>>> from itertools import groupby
>>> numbers = [1, 2, 3, 4, 5, 1]
>>> gt2 = lambda x : x > 2
>>> [(key, list(group)) for key, group in groupby(numbers, gt2)]
[(False, [1, 2]), (True, [3, 4, 5]), (False, [1])]

We group a list of numbers depending on if they are greater than 2 or not. groupby reads the numbers one-by-one and whenever the key function tt>gt2 changes its value it creates a new group. groupby returns an iterator over the keys and groups it produces, and each group itself is an iterator over its elements. In the example above we use list to convert the group iterator to a list and then use a list comprehension to print the keys and groups.

Note that groupby almost does what we need to read the records in our FASTA file. We can group the lines and start a new group whenever we encounter a header line. Since we don't need the result of the key function we can simply write:

chunks = (list(g) for _, g in groupby(lines, isheader))

Each chunk is a list of lines, containing either all header lines or all sequence lines. For instance, if we print the chunks

for chunk in chunks:
    print chunk

we get the following output (sequences are shortened for readability):

[';LCBO - Prolactin precursor', '; a sample sequence in FASTA format']
['MDS...LSS', 'EMF...YHL', 'VTE...DED', 'ARY...NNC*']
['>MCHU - Calmodulin']
['ADQ...GTID', 'FPE...REA', 'DID...TAK*']
['>gi|5524211|gb|AAD44166.1']
['LCL...NLV', 'EWI...DFLG', 'LLI...IVIL', 'GLM...AGX', 'IENY']

What is left to do is to bring the header lines and their sequence lines together, and there is a very neat trick to achieve this. Remember this one! As you know, zip zips iterables:

>>> zip([1, 2, 3], 'abc')
[(1, 'a'), (2, 'b'), (3, 'c')]

Under Python 2.x zip return a list and under Python 3.x it returns an iterator. izip under Python 2.x does the same as zip under Python 3.x and returns an iterator. More importantly, however, zip and izip operate on iterators and here comes the trick! If we want to group elements of an iterable let's say in tuples of size 2 we can pass the same iterator twice and zip will do the job:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers)
[(1, 2), (3, 4), (5, 6)]

Note that we wrap numbers into an iterator by calling iter(). Without that, the result would be different:

>>> numbers = [1, 2, 3, 4, 5, 6]
>>> zip(numbers, numbers)
[(1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6)]

What's going on here? zip reads the an element from numbers and reads again to build the first tuple. Since numbers is an iterator the element pointer progresses and we get (1, 2). We could equally easily group numbers in triples:

>>> numbers = iter([1, 2, 3, 4, 5, 6])
>>> zip(numbers, numbers, numbers)
[(1, 2, 3), (4, 5, 6)]

Obviously, this works also for our FASTA file reading. We can group the chunks of header or sequence lines in tuples of size 2 to bring header and sequence together. We use izip because we don't want to collect all sequences in memory. So the fourth line of

seqs = ((h[0], ''.join(s)) for h,s in izip(chunks, chunks))

the itertools solution above means that we group header lists and sequence lists in tuples h, s. But we need only the first header line and the concatenation of the sequence lines and therefore return (h[0], ''.join(s)) in the generator expression.

There you go! A much simpler implementation to read FASTA files and other line-structured files. Can we do even better? Using a library for data flows such as nuts-flow leads to even shorter and more readable code:

from nutsflow import *
from nutsflow import _

isheader = lambda line: line[0] in {'>', ';'}
seqs = (open('fasta.txt') >> Map(str.strip) >> Filter(_) >> 
        ChunkBy(isheader) >> Chunk(2) >> MapCol(0,next) >> MapCol(1,''.join))

The >> operator in nuts-flow indicates the chaining of iterators and defines the data flow. Here we open the file, map the string stripping function on each line to remove white spaces and then filter out empty lines. The underscore in Filter(_) functions as an anonymous variable here. Similar to Python's groupby, ChunkBy groups the lines according to the isheader predicate. We then call Chunk(2) to bring header and sequence lines together in tuples of the form (header, sequence) -- no need for the zip-trick we used before. Finally, we map next() on column 0 and join on column 1 of these tuples to extract the first header line and to concatenate the sequence lines.

That's it. Can't get better than that ;)