简体   繁体   中英

Lambda function with itertools count() and groupby()

Can someone please explain the groupby operation and the lambda function being used on this SO post?

key=lambda k, line=count(): next(line) // chunk

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:

    # The itertools.groupby() function takes a sequence and a key function,
    # and returns an iterator that generates pairs.

    # Each pair contains the result of key_function(each item) and
    # another iterator containing all the items that shared that key result.

        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:

            print(key, list(group))

            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.

Martijn explained it really well, however I have a follow up question. Why is line=count() passed as an argument to the lambda function every time? I tried assigning the variable line to count() just once, outside the function.

    line = count()
    groups = groupby(datafile, key=lambda k, line: next(line) // chunk)

and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'

Also, calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by the groupby function.

groups = groupby(datafile, key=lambda k: next(count()) // chunk)

I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!

itertools.count() is an infinite iterator of increasing integer numbers.

The lambda stores an instance as a keyword argument, so every time the lambda is called the local variable line references that object. next() advances an iterator, retrieving the next value:

>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3

So next(line) retrieves the next count in the sequence, and divides that value by chunk (taking only the integer portion of the division). The k argument is ignored.

Because integer division is used, the result of the lambda is going to be chunk repeats of an increasing integer; if chunk is 3, then you get 0 three times, then 1 three times, then 2 three times, etc:

>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]

It is this resulting value that groupby() groups the datafile iterable by, producing groups of chunk lines.

When looping over the groupby() results with for k, group in groups: , k is the number that the lambda produced and the results are grouped by; the for loop in the code ignores this. group is an iterable of lines from datafile , and will always contain chunk lines.

In response to the updated OP...

The itertools.groupby iterator offers ways to group items together, giving more control when a key function is defined. See more on how itertools.groupby() works .

The lambda function, is a functional, shorthand way of writing a regular function. For example:

>>> keyfunc = lambda k, line=count(): next(line)

Is equivalent to this regular function:

>>> def keyfunc(k, line=count()):
...     return next(line) // chunk

Keywords : iterator, functional programming, anonymous functions


Details

Why is line=count() passed as an argument to the lambda function every time?

The reason is the same for normal functions. The line parameter by itself is a positional argument . When a value is assigned, it becomes a default keyword argument . See more on positional vs. keyword arguments .

You can still define line=count() outside the function by assigning the result to a keyword argument:

>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk       # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6]                               # note `count()` continues

... calling next on count() directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by the groupby function ...

Try the following experiment with count() :

>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2

As expected, you will notice next() is yielding the next item from the count() iterator. (A similar function is called iterating an iterator with a for loop). What is unique here is that generators do not reset - next() simply gives the next item in the line (as seen in the former example).

@Martijn Pieters pointed out next(line) // chunk computes a floored integer that is used by groupby to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how groupby works.

References

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM