简体   繁体   中英

itertools.groupby function seems inconsistent

I'm having trouble understanding exactly what it is that this function does because of I guess, the programming magic around its use?

It seems to me like it returns a list of keys (unique letters in a string) paired with iterators, that reference a list of the number of each of those letters in the original string, but sometimes it seems like this is not the case.

For example:

import itertools

x = list(itertools.groupby("AAABBB"))
print x

which prints:

[('A', <itertools._grouper object at 0x101a0b050), 
 ('B', <itertools._grouper object at 0x101a0b090)]

This seems correct, we have our unique keys paired with iterators. But when I run:

print list(x[0][1])

I get:

[]

and when I run

for k, g in x:
    print k + ' - ' + g

I get:

B - <itertools._grouper object at 0x1007eedd5>

It ignores the first element. This seems counter-intuitive, because if I just change the syntax a little bit:

[list(g) for k, g in itertools.groupby("AAABBB")]

I get:

[["A", "A", "A"], ["B", "B", "B"]]

which is right, and aligns with what I think this function should be doing.

However, if I once again change the syntax just a bit:

[list(thing) for thing in [g for k, g in itertools.groupby(string)]]

I get back:

[[], ['B']]

These two list comprehensions should be directly equivalent, but they return different results.

What is going on? Insight would be extremely appreciated.

The docs already explain why your listcomps aren't equivalent:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list

Your

[list(g) for k, g in itertools.groupby("AAABBB")]

does use each group before groupby() advances, so it works.

Your

[list(thing) for thing in [g for k, g in itertools.groupby(string)]]

doesn't use any group until after all groups have been generated. Not at all the same, and for the reason the quoted docs explained.

To get the answers you expect, convert the returned iterators to a list.

Groupby consumes an input iterator lazily (that means that it reads data only as needed). To find a new group, it needs to read up to next non-equal element (the first member of the next group). If you list the subgroup iterator, it will advance the input to the end of the current group.

In general, if you advance to the next group, then the previously returned subgroup iterator won't have an data and will appear empty. So, if you need the data in the subgroup iterator, you need to list it before advancing to the next group.

The reason for this behavior is that iterators are all about looking a one piece of data at a time and not keeping anything unnecessary in memory.

Here's some code that make all the operations visible:

from itertools import groupby

def supply():
    'Make the lazy input visible'
    for c in 'aaaaabbbcdddddddeeee':
        print('supplying %r' % c)
        yield c

print("\nCase where we don't consume the sub-iterator")
for k, g in groupby(supply()):
    print('Got group for %r' % k)

print("\nCase where we do consume the sub-iterator before advancing")
for k, g in groupby(supply()):
    print('Got group for %r' % k)
    print(list(g))

In the example "that is driving you crazy", the list operation is being applied too late (in the outer list comprehension). The solution is to move the list step to the inner comprehension:

>>> import itertools
>>> [list(g) for k, g in itertools.groupby('aaaaabbbb')]
>>> [['a', 'a', 'a', 'a', 'a'], ['b', 'b', 'b', 'b']]

If you don't really care about conserving memory, then running grouped = [list(g) for k, g in itertools.groupby(data)] is a perfectly reasonable way to go. Then you can lookup data in any of the sublists whenever you want and not be subject to rules about when the iterator is consumed. In general, list of lists are easier to work with than iterators. Hope this helps :-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM