简体   繁体   English

python groupby itertools 列出方法

[英]python groupby itertools list methods

I have a list like this: #[YEAR, DAY, VALUE1, VALUE2, VALUE3]我有一个这样的列表:#[YEAR, DAY, VALUE1, VALUE2, VALUE3]

[[2014, 1, 10, 20, 30],
[2014, 1, 3, 7, 4],
[2014, 2, 14, 43,5],
[2014, 2, 33, 1, 6]
...
[2013, 1, 34, 54, 3],
[2013, 2, 23, 33, 2],
...]

and I need to group by years and days, to obtain something like:我需要按年和天分组,以获得类似的东西:

[[2014, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2014, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],
....
[2013, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2013, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],,
....]

How can I do that with itertool?, I can't use pandas or numpy because my system doesn't support it.我怎么能用 itertool 做到这一点?,我不能使用 pandas 或 numpy,因为我的系统不支持它。 Thanks a lot for your help.非常感谢您的帮助。

import itertools
import operator

key = operator.itemgetter(0,1)
my_list.sort(key=key)
for (year, day), records in itertools.groupby(my_list, key):
    print("Records on", year, day, ":")
    for record in records: print(record)

itertools.groupby doesn't work like SQL's GROUPBY . itertools.groupby不像 SQL 的GROUPBY那样工作。 It groups in-order.它按顺序分组。 This means that if you have a list of elements that are not sorted, you may get multiple groups on the same key.这意味着如果您有一个未排序的元素列表,您可能会在同一个键上获得多个组。 So, let's say you want to group a list of integers based on their parity (even vs odd), then you might do this:因此,假设您想根据奇偶校验(偶数与奇数)对整数列表进行分组,那么您可以这样做:

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
itertools.groupby(L, lambda i:i%2)

Now, if you come from an SQL world, you might think that this gives you two groups - one for the even numbers, and one for the odd numbers.现在,如果您来自 SQL 世界,您可能会认为这为您提供了两组 - 一组用于偶数,一组用于奇数。 While this makes sense, it is not how Python does things.虽然这是有道理的,但这不是 Python 做事的方式。 It considers each element in turn and checks if it belongs to the same group as the previous element.它依次考虑每个元素并检查它是否与前一个元素属于同一组。 If so, both elements are added to the group;如果是,则将两个元素都添加到组中; else, each element gets its own group.否则,每个元素都有自己的组。

So with the above list, we get:所以通过上面的列表,我们得到:

key: 1
elements: [1]

key: 0
elements[2]

key: 1
elements: [3]

key: 0
elements[4]

key: 1
elements: [5,7]  # see what happened here?

So if you're looking to make a grouping like in SQL, then you'll want to sort the list before hand, by the key (criteria) with which you want to group:因此,如果您希望像在 SQL 中那样进行分组,那么您需要事先按照要分组的键(标准)对列表进行排序:

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
L.sort(key=lambda i:i%2)  # now L looks like this: [2,4,1,3,5,7] - the odds and the evens stick together
itertools.groupby(L, lambda i:%2)  # this gives two groups containing all the elements that belong to each group

I've tried to make a short and concise answer but I didn't suceed but I've managed to get a lot of python builtin modules involved:我试图做出一个简短而简洁的答案,但我没有成功,但我设法让很多 python 内置模块参与进来:

import itertools
import operator
import functools

I'll use functools.reduce to do the sums but it needs a custom function:我将使用functools.reduce进行求和,但它需要一个自定义函数:

def sum_sum_sum_counter(res, array):
    # Unpack the values of the array
    year, day, val1, val2, val3 = array
    res[0] += val1
    res[1] += val2
    res[2] += val3
    res[3] += 1 # counter
    return res

This function has a counter because you want to calculate the average it's more intuitive than a running mean implementation.这个函数有一个计数器,因为你想计算平均值,它比运行均值实现更直观。

Now the fun part: I'll group by the first two elements (assuming these are sorted otherwise one would need something like lst = sorted(lst, key=operator.itemgetter(0,1)) before:现在有趣的部分:我将按前两个元素分组(假设这些元素已排序,否则在之前需要像lst = sorted(lst, key=operator.itemgetter(0,1))

result = []
for i, values in itertools.groupby(lst, operator.itemgetter(0,1)):
    # Now let's use the reduce function with a start list containing zeros
    calc = functools.reduce(sum_sum_sum_counter, values, [0, 0, 0, 0])
    # Append year, day and the results.
    result.append([i[0], i[1], calc[0], calc[1], calc[2]/calc[3]])

The calc[2]/calc[3] is the average of value3. calc[2]/calc[3]是 value3 的平均值。 Remember the last element in the reduce function was a counter!请记住, reduce函数中的最后一个元素是一个计数器! And a sum divided by the counts is the average.总和除以计数就是平均值。

Giving me a result:给我一个结果:

[[2014, 1, 13, 27, 17.0],
 [2014, 2, 47, 44, 5.5],
 [2013, 1, 34, 54, 3.0],
 [2013, 2, 23, 33, 2.0]]

just using those values you've given.只需使用您提供的那些值。

On real data, sorting before grouping might become inefficient:在实际数据上,分组前排序可能会变得效率低下:

  • firstly the complete iterator will be consumed, loosing one important goal of functional programming, laziness首先会消耗完整的迭代器,失去函数式编程的一个重要目标,懒惰
  • sorting is O(n log n) compared to grouping O(n)与分组 O(n) 相比,排序是 O(n log n)

To group by some predicate the SQL + pythonic way, some simple reduce/accumulate using a collection.defaultdict will do:要按 SQL + pythonic 方式的某些谓词进行分组,使用 collection.defaultdict 进行一些简单的缩减/累加即可:

from functools import reduce
from collections import defaultdict as DD

def groupby( pred, it ):
  return reduce( lambda d,x: d[ pred(x) ].append(x) or d, it, DD(list) )

Then use it with some predicate function or lambda:然后将它与一些谓词函数或 lambda 一起使用:

>>> words = 'your code might become less readable using reduce'.split()
>>> groupby( len, words )[4]
['your', 'code', 'less']

Concerning laziness, reduce won't return before consuming all input, neither of course.关于懒惰,在消耗所有输入之前,reduce 不会返回,当然也不会。 You might use itertools.accumulate, instead, always returning the same defaultdict, to consume (and process the changing groups) lazily and with low memory footprint.您可以使用 itertools.accumulate,而是始终返回相同的 defaultdict,以懒惰地使用(并处理更改的组)且内存占用低。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM