简体   繁体   中英

Baffled by itertools groupby summation

Consider this...

from itertools import groupby
from operator import itemgetter

data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))

for pid, rows in groupby(data, lambda x: x['pid']):
    print(pid, sum(r['items'] for r in rows))
    for key in ['items']:
        print(pid, sum(r[key] for r in rows))

The first print() call prints the right #, 4 for pid 1, 5 for 2. The second print() call, in the loop through the key list, prints 0 for both. What's going on?

The rows object you get from groupby is a type of generator that can only be consumed once. As you iterate through it for your first print statement, you consume the values, and thus rows is an empty generator when you try to iterate over it the next time -- you've already visited and used up your access to its iteration abilities.

You could use row_list = list(rows) then use row_list if you want the items to be persistent for multiple iteration passes.

For greater clarity, I suggest putting your code into the Python REPL and inspecting type(rows) in that loop, and looking at what API that object provides.

You're running into a very common issue with generators - that they can only be iterated through once. itertools returns generators as a rule.

From the docs for groupby :

The returned group is itself an iterator that shares the underlying iterable with groupby() . Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible.

Simply remove one of your print() calls, and watch it work. If you need to access the returned data multiple times, a list is a potential structure to save your results in.

Fixed code:

from itertools import groupby
from operator import itemgetter

data = [{'pid': 1, 'items': 1}, {'pid': 2, 'items': 5}, {'pid': 1, 'items': 3}]
data = sorted(data, key=itemgetter('pid'))

for pid, rows_gen in groupby(data, lambda x: x['pid']):
    rows=list(rows_gen)      # save the group to access more than once
    print(pid, sum(r['items'] for r in rows))
    for key in ['items']:
        print(pid, sum(r[key] for r in rows))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM