[英]Baffled by itertools groupby summation
Consider this... 考虑一下...
from itertools import groupby
from operator import itemgetter
data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))
for pid, rows in groupby(data, lambda x: x['pid']):
print(pid, sum(r['items'] for r in rows))
for key in ['items']:
print(pid, sum(r[key] for r in rows))
The first print()
call prints the right #, 4 for pid
1, 5 for 2. The second print()
call, in the loop through the key list, prints 0 for both. 第一个
print()
调用将为pid
1打印正确的#,4,为2 print()
5,第二个print()
调用在通过键列表进行循环中的打印为0。 What's going on? 这是怎么回事?
The rows
object you get from groupby
is a type of generator that can only be consumed once. 从
groupby
获得的rows
对象是一种只能使用一次的生成器。 As you iterate through it for your first print statement, you consume the values, and thus rows
is an empty generator when you try to iterate over it the next time -- you've already visited and used up your access to its iteration abilities. 当您遍历第一个print语句时,您将消耗这些值,因此,当您下次尝试对其进行遍历时,
rows
是一个空生成器-您已经访问并用尽了对其迭代功能的访问权限。
You could use row_list = list(rows)
then use row_list
if you want the items to be persistent for multiple iteration passes. 您可以使用
row_list = list(rows)
然后使用row_list
使项目在多个迭代遍历中保持row_list
。
For greater clarity, I suggest putting your code into the Python REPL and inspecting type(rows)
in that loop, and looking at what API that object provides. 为了更加清晰,我建议将您的代码放入Python REPL中,并在该循环中检查
type(rows)
,并查看该对象提供的API。
You're running into a very common issue with generators - that they can only be iterated through once. 生成器遇到了一个非常普遍的问题-生成器只能迭代一次。
itertools
returns generators as a rule. itertools
通常会返回生成器。
From the docs for groupby
: 从
groupby
的文档中 :
The returned group is itself an iterator that shares the underlying iterable with
groupby()
.返回的组本身就是一个迭代器,它与
groupby()
共享基础的可迭代对象。 Because the source is shared, when thegroupby()
object is advanced, the previous group is no longer visible.因为源是共享的,所以当前进
groupby()
对象时,先前的组不再可见。
Simply remove one of your print()
calls, and watch it work. 只需删除您的
print()
调用之一,并观察其工作即可。 If you need to access the returned data multiple times, a list is a potential structure to save your results in. 如果您需要多次访问返回的数据,则列表是保存结果的潜在结构。
Fixed code: 固定代码:
from itertools import groupby
from operator import itemgetter
data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))
for pid, rows_gen in groupby(data, lambda x: x['pid']):
rows=list(rows_gen) # save the group to access more than once
print(pid, sum(r['items'] for r in rows))
for key in ['items']:
print(pid, sum(r[key] for r in rows))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.