Can someone please explain the groupby operation and the lambda function being used on this SO post?
key=lambda k, line=count(): next(line) // chunk
import tempfile
from itertools import groupby, count
temp_dir = tempfile.mkdtemp()
def tempfile_split(filename, temp_dir, chunk=4000000):
with open(filename, 'r') as datafile:
# The itertools.groupby() function takes a sequence and a key function,
# and returns an iterator that generates pairs.
# Each pair contains the result of key_function(each item) and
# another iterator containing all the items that shared that key result.
groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
for k, group in groups:
print(key, list(group))
output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
for line in group:
with open(output_name, 'a') as outfile:
outfile.write(line)
Edit: It took me a while to wrap my head around the lambda function used with groupby. I don't think I understood either of them very well.
Martijn explained it really well, however I have a follow up question. Why is line=count()
passed as an argument to the lambda function every time? I tried assigning the variable line
to count()
just once, outside the function.
line = count()
groups = groupby(datafile, key=lambda k, line: next(line) // chunk)
and it resulted in TypeError: <lambda>() missing 1 required positional argument: 'line'
Also, calling next
on count()
directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by the groupby
function.
groups = groupby(datafile, key=lambda k: next(count()) // chunk)
I'm learning Python on my own, so any help or pointers to reference materials /PyCon talks are much appreciated. Anything really!
itertools.count()
is an infinite iterator of increasing integer numbers.
The lambda
stores an instance as a keyword argument, so every time the lambda is called the local variable line
references that object. next()
advances an iterator, retrieving the next value:
>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3
So next(line)
retrieves the next count in the sequence, and divides that value by chunk
(taking only the integer portion of the division). The k
argument is ignored.
Because integer division is used, the result of the lambda
is going to be chunk
repeats of an increasing integer; if chunk
is 3, then you get 0
three times, then 1
three times, then 2
three times, etc:
>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]
It is this resulting value that groupby()
groups the datafile
iterable by, producing groups of chunk
lines.
When looping over the groupby()
results with for k, group in groups:
, k
is the number that the lambda
produced and the results are grouped by; the for
loop in the code ignores this. group
is an iterable of lines from datafile
, and will always contain chunk
lines.
In response to the updated OP...
The itertools.groupby
iterator offers ways to group items together, giving more control when a key function is defined. See more on how itertools.groupby()
works .
The lambda
function, is a functional, shorthand way of writing a regular function. For example:
>>> keyfunc = lambda k, line=count(): next(line)
Is equivalent to this regular function:
>>> def keyfunc(k, line=count()):
... return next(line) // chunk
Keywords : iterator, functional programming, anonymous functions
Details
Why is
line=count()
passed as an argument to the lambda function every time?
The reason is the same for normal functions. The line
parameter by itself is a positional argument . When a value is assigned, it becomes a default keyword argument . See more on positional vs. keyword arguments .
You can still define line=count()
outside the function by assigning the result to a keyword argument:
>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6] # note `count()` continues
... calling next on
count()
directly within the lambda expression, resulted in all the lines in the input file getting bunched together ie a single key was generated by thegroupby
function ...
Try the following experiment with count()
:
>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2
As expected, you will notice next()
is yielding the next item from the count()
iterator. (A similar function is called iterating an iterator with a for
loop). What is unique here is that generators do not reset - next()
simply gives the next item in the line (as seen in the former example).
@Martijn Pieters pointed out next(line) // chunk
computes a floored integer that is used by groupby
to identify each line (bunching similar lines with similar ids together), which is also expected. See the references for more on how groupby
works.
References
itertools.count
itertools.groupby()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.