简体   繁体   English

在 itertools.groupby 的结果上使用 zip 意外给出空列表

[英]Using zip on the results of itertools.groupby unexpectedly gives empty lists

I've encountered some unexpected empty lists when using zip to transpose the results of itertools.groupby .在使用zip转置itertools.groupby的结果时,我遇到了一些意外的空列表。 In reality my data is a bunch of objects, but for simplicity let's say my starting data is this list:实际上,我的数据是一堆对象,但为简单起见,假设我的起始数据是这个列表:

> a = [1, 1, 1, 2, 1, 3, 3, 2, 1]

I want to group the duplicates, so I use itertools.groupby (sorting first, because otherwise groupby only groups consecutive duplicates):我想对重复项进行分组,所以我使用itertools.groupby (首先排序,否则groupby只对连续重复项进行分组):

from itertools import groupby
duplicates = groupby(sorted(a))

This gives an itertools.groupby object which when converted to a list gives这给出了一个itertools.groupby object ,当它转换为列表时给出

[(1, <itertools._grouper object at 0x7fb3fdd86850>), (2, <itertools._grouper object at 0x7fb3fdd91700>), (3, <itertools._grouper object at 0x7fb3fdce7430>)]

So far, so good.到目前为止,一切都很好。 But now I want to transpose the results so I have a list of the unique values, [1, 2, 3] , and a list of the items in each duplicate group, [<itertools._grouper object...>, ...] .但是现在我想转置结果,所以我有一个唯一值列表[1, 2, 3]和每个重复组中的项目列表[<itertools._grouper object...>, ...] . For this I used the solution in this answer on using zip to "unzip":为此,我在this answer中使用了使用 zip 来“解压缩”的解决方案:

>>> keys, values = zip(*duplicates)
>>> print(keys)
(1, 2, 3)
>>> print(values)
(<itertools._grouper object at 0x7fb3fdd37940>, <itertools._grouper object at 0x7fb3fddfb040>, <itertools._grouper object at 0x7fb3fddfb250>)

But when I try to read the itertools._grouper objects, I get a bunch of empty lists:但是当我尝试读取itertools._grouper对象时,我得到了一堆空列表:

>>> for value in values:
...    print(list(value))
...
[]
[]
[]

What's going on?这是怎么回事? Shouldn't each value contain the duplicates in the original list, ie (1, 1, 1, 1, 1) , (2, 2) and (3, 3) ?每个value不应该包含原始列表中的重复项,即(1, 1, 1, 1, 1)(2, 2)(3, 3)吗?

To have grouping by each unique key for duplicate processing:要按每个唯一键进行分组以进行重复处理:

import itertools

a = [1, 1, 1, 2, 1, 3, 3, 2, 1]
g1 = itertools.groupby(sorted(a))
for k,v in g1:
    print(f"Key {k} has", end=" ")
    for e in v:
        print(e, end=" ")
    print()
# Key 1 has 1 1 1 1 1 
# Key 2 has 2 2 
# Key 3 has 3 3 

If it's just for counting how many, with minimal sorting:如果它只是为了计算有多少,用最少的排序:

import itertools
import collections

a = [1, 1, 1, 2, 1, 3, 3, 2, 1]
g1 = itertools.groupby(a)
c1 = collections.Counter()
for k,v in g1:
    l = len(tuple(v))
    c1[k] += l
for k,v in c1.items():
    print(f"Element {k} repeated {v} times")
# Element 1 repeated 5 times
# Element 2 repeated 2 times
# Element 3 repeated 2 times

Ah.啊。 The beauty of multiple iterator all using the same underlying object.多个迭代器的美都使用相同的底层 object。

The documentation of groupby addresses this very issue: groupby的文档解决了这个问题:

The returned group is itself an iterator that shares the underlying iterable with groupby() .返回的组本身就是一个迭代器,它与groupby()共享底层迭代。 Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible.因为源是共享的,所以当groupby() object 高级时,之前的组不再可见。 So, if that data is needed later, it should be stored as a list:因此,如果以后需要该数据,则应将其存储为列表:

 groups = [] uniquekeys = [] data = sorted(data, key=keyfunc) for k, g in groupby(data, keyfunc): groups.append(list(g)) # Store group iterator as a list uniquekeys.append(k)

So what ends up happening is that all your itertools._grouper objects are consumed before you ever unpack them.所以最终发生的事情是,你所有的itertools._grouper对象在你解压它们之前就被消耗掉了。 You see a similar effect if you try reusing any other iterator more than once.如果您尝试多次重用任何其他迭代器,您会看到类似的效果。 If you want to understand better, look at the next paragraph in the docs, which shows how the internals of groupby actually work.如果您想更好地理解,请查看文档中的下一段,它显示了groupby的内部实际上是如何工作的。

Part of what helped me understand this is to work examples with a more obviously non-reusable iterator, like a file object.帮助我理解这一点的部分原因是使用更明显不可重用的迭代器来处理示例,例如文件 object。 It helps to dissociate from the idea of an underlying buffer you can just keep track of.它有助于摆脱您可以跟踪的底层缓冲区的想法。

A simple fix is to consume the objects yourself, as the documentation recommends:正如文档建议的那样,一个简单的解决方法是自己使用对象:

# This is an iterator over a list:
duplicates = groupby(sorted(a))

# If you convert duplicates to a list, you consume it

# Don't store _grouper objects: consume them yourself:
keys, values = zip(*((key, list(value)) for key, value in duplicates)

As the other answer suggests, you don't need an O(N log N) solution that involves sorting, since you can do this in O(N) time in a single pass.正如另一个答案所暗示的那样,您不需要涉及排序的O(N log N)解决方案,因为您可以在O(N)时间内一次性完成此操作。 Rather than use a Counter , though, I'd recommend a defaultdict to help store the lists:不过,我建议使用defaultdict来帮助存储列表,而不是使用Counter

from collections import defaultdict

result = defaultdict(list)
for item in a:
    result[item].append(item)

For more complex objects, you'd index with key(item) instead of item .对于更复杂的对象,您可以使用key(item)而不是item进行索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM