Tuesday, May 2, 2017

Fun with itertools in Python

Let's say we are given a file 'number1.txt' that contains a header, numbers and blank lines, and the task is to sum up all numbers within the file. Here an example file



If the file can be loaded into memory this is easy

>>> lines = list(open('numbers1.txt'))
>>> sum(int(l) for l in lines[1:] if l.strip())

open() returns an iterator over the lines in the file, which we collect in a list. Alternatively we could have used open().readlines(). We call lines[1:] to skip the header line and test for non-empty lines with if l.strip(). A sum over a list comprehension finally returns the wanted result. Note that for this simple example we don't bother with closing the file, which you should do for robust code.

If the file is too big to be loaded into memory things are getting a bit ugly. open() returns an iterator, which is good but we need to skip the first header line and cannot apply list slicing to an iterator. However, we can advance the iterator by calling next() to solve this issue

>>> lines = open('numbers1.txt')
>>> next(lines)
>>> sum(int(l) for l in lines if l.strip())

But what if there are multiple header lines and multiple files we want to sum over? Let's implement this in plain Python and then use itertools to solve this more elegantly.

from glob import glob

def filesum(fname, skip=1):   # sum numbers in file
  lines = open(fname)
  for _ in xrange(skip):
  return sum(int(l) for l in lines if l.strip())

sum(filesum(fname) for fname in glob('numbers*.txt'))

As you can see things are getting a tad ugly, especially with respect to the skipping of multiple header lines. We are going to use islice, ifilter, and imap from the itertools library to clean things up. Note that the names of most functions in itertools start with an 'i' to indicate that they are returning iterators and do not collect results in memory.

Similar to list slicing, islice() takes a slice out of an iterator. Here we use it to skip one header line

>>> from itertools import islice
>>> list(islice(open('numbers1.txt'), 1, None))
['1\n', '2\n', '\n', '3\n', '4']

where islice(iterable, start, stop[, step]) takes an iterable as input and skips the first start elements of it, if called with None for the stop parameter. Since islice returns an iterator we wrap it into a list to display the result.

The next step is to filter out the blank lines and we employ ifilter() for this purpose

>>> from itertools import islice, ifilter

>>> lines = islice(open('numbers1.txt'), 1, None)
>>> list(ifilter(lambda l: l.strip(), lines))
['1\n', '2\n', '3\n', '4']

In the next step we convert the number strings into integers by mapping the int function on the lines of the file using imap()

>>> from itertools import islice, ifilter, imap

>>> alllines = islice(open('numbers1.txt'), 1, None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> list(imap(int, numlines))
[1, 2, 3, 4]

Note that imap, in contrast to map returns an iterator in Python 2, while in Python 3, map is equivalent to imap. Having an iterator over integer numbers, we can now sum them up

>>> from itertools import islice, ifilter, imap
>>> SKIP = 1
>>> alllines = islice(open('numbers1.txt'), SKIP , None)
>>> numlines = ifilter(lambda l: l.strip(), alllines)
>>> sum(imap(int, numlines))

What is left to do is to aggregate over multiple files and similarly to our first implementation we define a separate function filesum for this purpose

from glob import glob
from itertools import islice, ifilter, imap, chain

def filesum(fname, skip=1):
    isnumber = lambda l: l.strip()
    lines = islice(open(fname), skip, None)
    return sum(imap(int, ifilter(isnumber, lines)))

sum(imap(filesum, glob('numbers*.txt')))

If you compare both implementations, I hope you will find the itertools-based code a bit more readable and elegant than the plain Python code presented at the start. But still the code seems overly complex for such a simple problem. An alternative to itertools is nuts-flow, which allows implementing data flows, such as the one above, much more easily. For instance, to sum up the numbers within the file we could simply write

>>> from nutsflow import Drop, Map, Sum, nut_filter
>>> IsNumber = nut_filter(lambda l: l.strip())

>>> open('numbers1.txt') >> Drop(1) >> IsNumber() >> Map(int) >> Sum()

where Drop(1) skips the header line, IsNumber filters out the numbers (non-empty lines), Map(int) converts to integers and Sum sums the numbers up. The explicit data flow and linear sequence of processing steps renders this code considerably more readable and it can easily be extended to operate over multiple files

>>> from glob import glob
>>> from nutsflow import Drop, Map, Sum, nut_filter

>>> IsNumber = nut_filter(lambda l: l.strip())
>>> filesum = lambda f: open(f) >> Drop(1) >> IsNumber() >> Map(int) >> Sum()
>>> glob('numbers*.txt') >> Map(filesum) >> Sum()

More about nuts-flow in a follow-up post. That's it for today ;)