简体   繁体   English

困惑于itertools groupby求和

[英]Baffled by itertools groupby summation

Consider this... 考虑一下...

from itertools import groupby
from operator import itemgetter

data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))

for pid, rows in groupby(data, lambda x: x['pid']):
    print(pid, sum(r['items'] for r in rows))
    for key in ['items']:
        print(pid, sum(r[key] for r in rows))

The first print() call prints the right #, 4 for pid 1, 5 for 2. The second print() call, in the loop through the key list, prints 0 for both. 第一个print()调用将为pid 1打印正确的#,4,为2 print() 5,第二个print()调用在通过键列表进行循环中的打印为0。 What's going on? 这是怎么回事?

The rows object you get from groupby is a type of generator that can only be consumed once. groupby获得的rows对象是一种只能使用一次的生成器。 As you iterate through it for your first print statement, you consume the values, and thus rows is an empty generator when you try to iterate over it the next time -- you've already visited and used up your access to its iteration abilities. 当您遍历第一个print语句时,您将消耗这些值,因此,当您下次尝试对其进行遍历时, rows是一个空生成器-您已经访问并用尽了对其迭代功能的访问权限。

You could use row_list = list(rows) then use row_list if you want the items to be persistent for multiple iteration passes. 您可以使用row_list = list(rows)然后使用row_list使项目在多个迭代遍历中保持row_list

For greater clarity, I suggest putting your code into the Python REPL and inspecting type(rows) in that loop, and looking at what API that object provides. 为了更加清晰,我建议将您的代码放入Python REPL中,并在该循环中检查type(rows) ,并查看该对象提供的API。

You're running into a very common issue with generators - that they can only be iterated through once. 生成器遇到了一个非常普遍的问题-生成器只能迭代一次。 itertools returns generators as a rule. itertools通常会返回生成器。

From the docs for groupby : groupby文档中

The returned group is itself an iterator that shares the underlying iterable with groupby() . 返回的组本身就是一个迭代器,它与groupby()共享基础的可迭代对象。 Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. 因为源是共享的,所以当前进groupby()对象时,先前的组不再可见。

Simply remove one of your print() calls, and watch it work. 只需删除您的print()调用之一,并观察其工作即可。 If you need to access the returned data multiple times, a list is a potential structure to save your results in. 如果您需要多次访问返回的数据,则列表是保存结果的潜在结构。

Fixed code: 固定代码:

from itertools import groupby
from operator import itemgetter

data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))

for pid, rows_gen in groupby(data, lambda x: x['pid']):
    rows=list(rows_gen)      # save the group to access more than once
    print(pid, sum(r['items'] for r in rows))
    for key in ['items']:
        print(pid, sum(r[key] for r in rows))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM