Lambda function with itertools count() and groupby()

Question

Can someone please explain the groupby operation and the lambda function being used on this SO post?

key=lambda k, line=count(): next(line) // chunk

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:

    # The itertools.groupby() function takes a sequence and a key function,
    # and returns an iterator that generates pairs.

    # Each pair contains the result of key_function(each item) and
    # another iterator containing all the items that shared that key result.

        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:

            print(key, list(group))

            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.

Martijn explained it really well, however I have a follow up question. Why is line=count() passed as an argument to the lambda function every time? I tried assigning the variable line to count() just once, outside the function.

    line = count()
    groups = groupby(datafile, key=lambda k, line: next(line) // chunk)

and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'

Also, calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by the groupby function.

groups = groupby(datafile, key=lambda k: next(count()) // chunk)

I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!

Answer 1

itertools.count() is an infinite iterator of increasing integer numbers.

The lambda stores an instance as a keyword argument, so every time the lambda is called the local variable line references that object. next() advances an iterator, retrieving the next value:

>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3

So next(line) retrieves the next count in the sequence, and divides that value by chunk (taking only the integer portion of the division). The k argument is ignored.

Because integer division is used, the result of the lambda is going to be chunk repeats of an increasing integer; if chunk is 3, then you get 0 three times, then 1 three times, then 2 three times, etc:

>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]

It is this resulting value that groupby() groups the datafile iterable by, producing groups of chunk lines.

When looping over the groupby() results with for k, group in groups: , k is the number that the lambda produced and the results are grouped by; the for loop in the code ignores this. group is an iterable of lines from datafile , and will always contain chunk lines.

Answer 2

In response to the updated OP...

The itertools.groupby iterator offers ways to group items together, giving more control when a key function is defined. See more on how itertools.groupby() works .

The lambda function, is a functional, shorthand way of writing a regular function. For example:

>>> keyfunc = lambda k, line=count(): next(line)

Is equivalent to this regular function:

>>> def keyfunc(k, line=count()):
...     return next(line) // chunk

Keywords : iterator, functional programming, anonymous functions

Details

Why is line=count() passed as an argument to the lambda function every time?

The reason is the same for normal functions. The line parameter by itself is a positional argument . When a value is assigned, it becomes a default keyword argument . See more on positional vs. keyword arguments .

You can still define line=count() outside the function by assigning the result to a keyword argument:

>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk       # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6]                               # note `count()` continues

... calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by the groupby function ...

Try the following experiment with count() :

>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2

As expected, you will notice next() is yielding the next item from the count() iterator. (A similar function is called iterating an iterator with a for loop). What is unique here is that generators do not reset - next() simply gives the next item in the line (as seen in the former example).

@Martijn Pieters pointed out next(line) // chunk computes a floored integer that is used by groupby to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how groupby works.

References

Docs for itertools.count
Docs for itertools.groupby()
Beazley, D. and Jones, B. "7.7 Capturing Variables in Anonymous Functions," Python Cookbook, 3rd ed. O'Reilly. 2013.

Lambda function with itertools count() and groupby()

Question

2 answers

solution1
3 ACCPTED 2017-12-27 16:30:03

solution2
1 2017-12-28 21:52:57

Lambda function with itertools count() and groupby()

Question

2 answers

solution1 3 ACCPTED 2017-12-27 16:30:03

solution2 1 2017-12-28 21:52:57

solution1
3 ACCPTED 2017-12-27 16:30:03

solution2
1 2017-12-28 21:52:57