简体   繁体   English

如何使用 itertools.groupby()?

[英]How do I use itertools.groupby()?

I haven't been able to find an understandable explanation of how to actually use Python's itertools.groupby() function. What I'm trying to do is this:我还没有找到关于如何实际使用 Python 的itertools.groupby() function 的可以理解的解释。我正在尝试做的是:

  • Take a list - in this case, the children of an objectified lxml element拿一个列表——在本例中,是一个对象化的lxml元素的子元素
  • Divide it into groups based on some criteria根据一些标准将其分组
  • Then later iterate over each of these groups separately.然后分别迭代这些组中的每一个。

I've reviewed the documentation , but I've had trouble trying to apply them beyond a simple list of numbers.我已经查看了文档,但我在尝试将它们应用到一个简单的数字列表之外时遇到了麻烦。

So, how do I use of itertools.groupby() ?那么,我该如何使用itertools.groupby()呢? Is there another technique I should be using?我应该使用另一种技术吗? Pointers to good "prerequisite" reading would also be appreciated.指向良好的“先决条件”阅读的指针也将不胜感激。

IMPORTANT NOTE: You have to sort your data first.重要提示:您必须先对数据进行排序


The part I didn't get is that in the example construction我没有得到的部分是在示例构造中

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
   groups.append(list(g))    # Store group iterator as a list
   uniquekeys.append(k)

k is the current grouping key, and g is an iterator that you can use to iterate over the group defined by that grouping key. k是当前分组键, g是一个迭代器,您可以使用它来迭代由该分组键定义的组。 In other words, the groupby iterator itself returns iterators.换句话说, groupby迭代器本身返回迭代器。

Here's an example of that, using clearer variable names:这是一个示例,使用更清晰的变量名称:

from itertools import groupby

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
    for thing in group:
        print("A %s is a %s." % (thing[1], key))
    print("")
    

This will give you the output:这将为您提供输出:

A bear is a animal.熊是一种动物。
A duck is a animal.鸭子是一种动物。

A cactus is a plant.仙人掌是一种植物。

A speed boat is a vehicle.快艇是一种交通工具。
A school bus is a vehicle.校车是交通工具。

In this example, things is a list of tuples where the first item in each tuple is the group the second item belongs to.在这个例子中, things是一个元组列表,其中每个元组中的第一项是第二项所属的组。

The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with. groupby()函数有两个参数:(1)要分组的数据和(2)要分组的函数。

Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key.这里, lambda x: x[0]告诉groupby()使用每个元组中的第一项作为分组键。

In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key.在上面for语句中, groupby返回三个(键,组迭代器)对 - 每个唯一键一次。 You can use the returned iterator to iterate over each individual item in that group.您可以使用返回的迭代器来迭代该组中的每个单独项目。

Here's a slightly different example with the same data, using a list comprehension:这是一个稍微不同的示例,使用列表推导,使用相同的数据:

for key, group in groupby(things, lambda x: x[0]):
    listOfThings = " and ".join([thing[1] for thing in group])
    print(key + "s:  " + listOfThings + ".")

This will give you the output:这将为您提供输出:

animals: bear and duck.动物:熊和鸭。
plants: cactus.植物:仙人掌。
vehicles: speed boat and school bus.交通工具:快艇和校车。

itertools.groupby is a tool for grouping items. itertools.groupby是用于对项目进行分组的工具。

From the docs , we glean further what it might do:docs中,我们进一步收集了它可能会做什么:

# [k for k, g in groupby('AAAABBBCCDAABBB')] --> ABCDAB

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D

groupby objects yield key-group pairs where the group is a generator. groupby对象产生键组对,其中组是生成器。

Features特征

  • A. Group consecutive items together A. 将连续的项目组合在一起
  • B. Group all occurrences of an item, given a sorted iterable B. 给定一个排序的可迭代项,对所有出现的项目进行分组
  • C. Specify how to group items with a key function * C. 指定如何使用按键功能对项目进行分组*

Comparisons比较

# Define a printer for comparing outputs
>>> def print_groupby(iterable, keyfunc=None):
...    for k, g in it.groupby(iterable, keyfunc):
...        print("key: '{}'--> group: {}".format(k, list(g)))
# Feature A: group consecutive occurrences
>>> print_groupby("BCAACACAADBBB")
key: 'B'--> group: ['B']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'D'--> group: ['D']
key: 'B'--> group: ['B', 'B', 'B']

# Feature B: group all occurrences
>>> print_groupby(sorted("BCAACACAADBBB"))
key: 'A'--> group: ['A', 'A', 'A', 'A', 'A']
key: 'B'--> group: ['B', 'B', 'B', 'B']
key: 'C'--> group: ['C', 'C', 'C']
key: 'D'--> group: ['D']

# Feature C: group by a key function
>>> # islower = lambda s: s.islower()                      # equivalent
>>> def islower(s):
...     """Return True if a string is lowercase, else False."""   
...     return s.islower()
>>> print_groupby(sorted("bCAaCacAADBbB"), keyfunc=islower)
key: 'False'--> group: ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']
key: 'True'--> group: ['a', 'a', 'b', 'b', 'c']

Uses用途

Note: Several of the latter examples derive from Víctor Terrón's PyCon (talk) (Spanish) , "Kung Fu at Dawn with Itertools".注意:后面的几个例子来自 Víctor Terrón 的 PyCon (谈话) (西班牙语) ,“Kung Fu at Dawn with Itertools”。 See also the groupby source code written in C.另请参阅用 C 编写的groupby 源代码

* A function where all items are passed through and compared, influencing the result. * 一个函数,所有项目都通过并比较,影响结果。 Other objects with key functions include sorted() , max() and min() .其他具有关键功能的对象包括sorted()max()min()


Response回复

# OP: Yes, you can use `groupby`, e.g. 
[do_something(list(g)) for _, g in groupby(lxml_elements, criteria_func)]

The example on the Python docs is quite straightforward: Python 文档中的示例非常简单:

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

So in your case, data is a list of nodes, keyfunc is where the logic of your criteria function goes and then groupby() groups the data.因此,在您的情况下, data 是节点列表, keyfunc是您的条件函数的逻辑所在,然后groupby()对数据进行分组。

You must be careful to sort the data by the criteria before you call groupby or it won't work.在调用groupby之前,您必须小心按条件对数据进行排序,否则它将不起作用。 groupby method actually just iterates through a list and whenever the key changes it creates a new group. groupby方法实际上只是遍历一个列表,每当键更改时,它都会创建一个新组。

A neato trick with groupby is to run length encoding in one line: groupby 的一个技巧是在一行中运行长度编码:

[(c,len(list(cgen))) for c,cgen in groupby(some_string)]

will give you a list of 2-tuples where the first element is the char and the 2nd is the number of repetitions.会给你一个 2 元组列表,其中第一个元素是 char,第二个是重复次数。

Edit: Note that this is what separates itertools.groupby from the SQL GROUP BY semantics: itertools doesn't (and in general can't) sort the iterator in advance, so groups with the same "key" aren't merged.编辑:请注意,这是将itertools.groupby与 SQL GROUP BY语义分开的原因: itertools 不会(通常也不能)提前对迭代器进行排序,因此不会合并具有相同“键”的组。

Another example:另一个例子:

for key, igroup in itertools.groupby(xrange(12), lambda x: x // 5):
    print key, list(igroup)

results in结果是

0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]

Note that igroup is an iterator (a sub-iterator as the documentation calls it).请注意, igroup是一个迭代器(文档称之为子迭代器)。

This is useful for chunking a generator:这对于分块生成器很有用:

def chunker(items, chunk_size):
    '''Group items in chunks of chunk_size'''
    for _key, group in itertools.groupby(enumerate(items), lambda x: x[0] // chunk_size):
        yield (g[1] for g in group)

with open('file.txt') as fobj:
    for chunk in chunker(fobj):
        process(chunk)

Another example of groupby - when the keys are not sorted. groupby的另一个示例 - 当键未排序时。 In the following example, items in xx are grouped by values in yy .在以下示例中, xx中的项目按yy中的值分组。 In this case, one set of zeros is output first, followed by a set of ones, followed again by a set of zeros.在这种情况下,首先输出一组零,然后输出一组 1,然后再输出一组零。

xx = range(10)
yy = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
for group in itertools.groupby(iter(xx), lambda x: yy[x]):
    print group[0], list(group[1])

Produces:产生:

0 [0, 1, 2]
1 [3, 4, 5]
0 [6, 7, 8, 9]

WARNING:警告:

The syntax list(groupby(...)) won't work the way that you intend.语法 list(groupby(...)) 不会按您想要的方式工作。 It seems to destroy the internal iterator objects, so using它似乎破坏了内部迭代器对象,所以使用

for x in list(groupby(range(10))):
    print(list(x[1]))

will produce:将产生:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[9]

Instead, of list(groupby(...)), try [(k, list(g)) for k,g in groupby(...)], or if you use that syntax often,取而代之的是 list(groupby(...)),尝试 [(k, list(g)) for k,g in groupby(...)],或者如果您经常使用该语法,

def groupbylist(*args, **kwargs):
    return [(k, list(g)) for k, g in groupby(*args, **kwargs)]

and get access to the groupby functionality while avoiding those pesky (for small data) iterators all together.并访问 groupby 功能,同时避免那些讨厌的(对于小数据)迭代器。

I would like to give another example where groupby without sort is not working.我想举另一个例子,没有排序的 groupby 不起作用。 Adapted from example by James Sulak改编自 James Sulak 的示例

from itertools import groupby

things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
    for thing in group:
        print "A %s is a %s." % (thing[1], key)
    print " "

output is输出是

A bear is a vehicle.

A duck is a animal.
A cactus is a animal.

A speed boat is a vehicle.
A school bus is a vehicle.

there are two groups with vehicule, whereas one could expect only one group有两组有车辆,而一个可以预期只有一组

@CaptSolo, I tried your example, but it didn't work. @CaptSolo,我尝试了您的示例,但是没有用。

from itertools import groupby 
[(c,len(list(cs))) for c,cs in groupby('Pedro Manoel')]

Output:输出:

[('P', 1), ('e', 1), ('d', 1), ('r', 1), ('o', 1), (' ', 1), ('M', 1), ('a', 1), ('n', 1), ('o', 1), ('e', 1), ('l', 1)]

As you can see, there are two o's and two e's, but they got into separate groups.如您所见,有两个 o 和两个 e,但它们分为不同的组。 That's when I realized you need to sort the list passed to the groupby function.那时我意识到您需要对传递给 groupby 函数的列表进行排序。 So, the correct usage would be:因此,正确的用法是:

name = list('Pedro Manoel')
name.sort()
[(c,len(list(cs))) for c,cs in groupby(name)]

Output:输出:

[(' ', 1), ('M', 1), ('P', 1), ('a', 1), ('d', 1), ('e', 2), ('l', 1), ('n', 1), ('o', 2), ('r', 1)]

Just remembering, if the list is not sorted, the groupby function will not work !请记住,如果列表未排序,则 groupby 功能将不起作用

Sorting and groupby排序和分组

from itertools import groupby

val = [{'name': 'satyajit', 'address': 'btm', 'pin': 560076}, 
       {'name': 'Mukul', 'address': 'Silk board', 'pin': 560078},
       {'name': 'Preetam', 'address': 'btm', 'pin': 560076}]


for pin, list_data in groupby(sorted(val, key=lambda k: k['pin']),lambda x: x['pin']):
...     print pin
...     for rec in list_data:
...             print rec
... 
o/p:

560076
{'name': 'satyajit', 'pin': 560076, 'address': 'btm'}
{'name': 'Preetam', 'pin': 560076, 'address': 'btm'}
560078
{'name': 'Mukul', 'pin': 560078, 'address': 'Silk board'}

How do I use Python's itertools.groupby()?如何使用 Python 的 itertools.groupby()?

You can use groupby to group things to iterate over.您可以使用 groupby 对要迭代的事物进行分组。 You give groupby an iterable, and a optional key function/callable by which to check the items as they come out of the iterable, and it returns an iterator that gives a two-tuple of the result of the key callable and the actual items in another iterable.您给 groupby 一个可迭代对象和一个可选的函数/可调用项,通过该键函数/可调用项来检查从可迭代项中出来的项目,并返回一个迭代器,该迭代器给出键可调用结果和实际项目的二元组另一个可迭代的。 From the help:从帮助:

groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).

Here's an example of groupby using a coroutine to group by a count, it uses a key callable (in this case, coroutine.send ) to just spit out the count for however many iterations and a grouped sub-iterator of elements:这是一个 groupby 使用协程按计数分组的示例,它使用一个可调用的键(在本例中为coroutine.send )只为多次迭代和一个分组的元素子迭代器吐出计数:

import itertools


def grouper(iterable, n):
    def coroutine(n):
        yield # queue up coroutine
        for i in itertools.count():
            for j in range(n):
                yield i
    groups = coroutine(n)
    next(groups) # queue up coroutine

    for c, objs in itertools.groupby(iterable, groups.send):
        yield c, list(objs)
    # or instead of materializing a list of objs, just:
    # return itertools.groupby(iterable, groups.send)

list(grouper(range(10), 3))

prints印刷

[(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9])]

Sadly I don't think it's advisable to use itertools.groupby() .可悲的是,我认为不建议使用itertools.groupby() It's just too hard to use safely, and it's only a handful of lines to write something that works as expected.安全使用太难了,只需要几行代码就能写出符合预期的东西。

def my_group_by(iterable, keyfunc):
    """Because itertools.groupby is tricky to use

    The stdlib method requires sorting in advance, and returns iterators not
    lists, and those iterators get consumed as you try to use them, throwing
    everything off if you try to look at something more than once.
    """
    ret = defaultdict(list)
    for k in iterable:
        ret[keyfunc(k)].append(k)
    return dict(ret)

Use it like this:像这样使用它:

def first_letter(x):
    return x[0]

my_group_by('four score and seven years ago'.split(), first_letter)

to get要得到

{'f': ['four'], 's': ['score', 'seven'], 'a': ['and', 'ago'], 'y': ['years']}

This basic implementation helped me understand this function.这个基本实现帮助我理解了这个功能。 Hope it helps others as well:希望它也可以帮助其他人:

arr = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (2, "E"), (3, "F")]

for k,g in groupby(arr, lambda x: x[0]):
    print("--", k, "--")
    for tup in g:
        print(tup[1])  # tup[0] == k
-- 1 --
A
B
C
-- 2 --
D
E
-- 3 --
F

One useful example that I came across may be helpful:我遇到的一个有用的例子可能会有所帮助:

from itertools import groupby

#user input

myinput = input()

#creating empty list to store output

myoutput = []

for k,g in groupby(myinput):

    myoutput.append((len(list(g)),int(k)))

print(*myoutput)

Sample input: 14445221样本输入:14445221

Sample output: (1,1) (3,4) (1,5) (2,2) (1,1)样本输出: (1,1) (3,4) (1,5) (2,2) (1,1)

from random import randint
from itertools import groupby

 l = [randint(1, 3) for _ in range(20)]

 d = {}
 for k, g in groupby(l, lambda x: x):
     if not d.get(k, None):
         d[k] = list(g)
     else:
         d[k] = d[k] + list(g)

the code above shows how groupby can be used to group a list based on the lambda function/key supplied.上面的代码显示了如何使用 groupby 根据提供的 lambda 函数/键对列表进行分组。 The only problem is that the output is not merged, this can be easily resolved using a dictionary.唯一的问题是输出没有合并,这可以使用字典轻松解决。

Example:例子:

l = [2, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 1, 3, 2, 3]

after applying groupby the result will be:应用 groupby 后,结果将是:

for k, g in groupby(l, lambda x:x):
    print(k, list(g))

2 [2]
1 [1]
2 [2]
3 [3]
1 [1]
3 [3]
2 [2]
1 [1]
3 [3, 3]
1 [1]
3 [3]
2 [2]
3 [3]
1 [1]
2 [2]
1 [1]
3 [3]
2 [2]
3 [3]

Once a dictionary is used as shown above following result is derived which can be easily iterated over:一旦使用了如上所示的字典,就会得出以下结果,可以轻松地对其进行迭代:

{2: [2, 2, 2, 2, 2, 2], 1: [1, 1, 1, 1, 1, 1], 3: [3, 3, 3, 3, 3, 3, 3, 3]}

The key thing to recognize with itertools.groupby is that items are only grouped together as long as they're sequential in the iterable .使用itertools.groupby识别的关键是,只要项目在 iterable 中是连续的,它们就会组合在一起 This is why sorting works, because basically you're rearranging the collection so that all of the items which satisfy callback(item) now appear in the sorted collection sequentially.这就是排序有效的原因,因为基本上您正在重新排列集合,以便满足callback(item)的所有项目现在按顺序出现在已排序的集合中。

That being said, you don't need to sort the list, you just need a collection of key-value pairs, where the value can grow in accordance to each group iterable yielded by groupby .也就是说,您不需要对列表进行排序,您只需要一组键值对,其中值可以根据groupby产生的每个可迭代组增长。 ie a list of dicts.即字典列表。

>>> things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
>>> coll = {}
>>> for k, g in itertools.groupby(things, lambda x: x[0]):
...     coll.setdefault(k, []).extend(i for _, i in g)
...
{'vehicle': ['bear', 'speed boat', 'school bus'], 'animal': ['duck', 'cactus']}

I haven't been able to find an understandable explanation of how to actually use Python's itertools.groupby() function.我一直无法找到有关如何实际使用 Python 的itertools.groupby()函数的可理解的解释。 What I'm trying to do is this:我想要做的是:

  • Take a list - in this case, the children of an objectified lxml element拿一个列表 - 在这种情况下,一个对象化的lxml元素的子元素
  • Divide it into groups based on some criteria根据某些标准将其分组
  • Then later iterate over each of these groups separately.然后分别迭代这些组中的每一个。

I've reviewed the documentation , but I've had trouble trying to apply them beyond a simple list of numbers.我已经查看了文档,但是在尝试将它们应用到简单的数字列表之外时遇到了麻烦。

So, how do I use of itertools.groupby() ?那么,我如何使用itertools.groupby() Is there another technique I should be using?我应该使用另一种技术吗? Pointers to good "prerequisite" reading would also be appreciated.指向良好的“先决条件”阅读的指针也将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM