简体   繁体   中英

python groupby itertools list methods

I have a list like this: #[YEAR, DAY, VALUE1, VALUE2, VALUE3]

[[2014, 1, 10, 20, 30],
[2014, 1, 3, 7, 4],
[2014, 2, 14, 43,5],
[2014, 2, 33, 1, 6]
...
[2013, 1, 34, 54, 3],
[2013, 2, 23, 33, 2],
...]

and I need to group by years and days, to obtain something like:

[[2014, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2014, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],
....
[2013, 1, sum[all values1 with day=1), sum(all values2 with day =1), avg(all values3 with day=1)],
[2013, 2, sum[all values1 with day=2), sum(all values2 with day =2), avg(all values3 with day=2)],,
....]

How can I do that with itertool?, I can't use pandas or numpy because my system doesn't support it. Thanks a lot for your help.

import itertools
import operator

key = operator.itemgetter(0,1)
my_list.sort(key=key)
for (year, day), records in itertools.groupby(my_list, key):
    print("Records on", year, day, ":")
    for record in records: print(record)

itertools.groupby doesn't work like SQL's GROUPBY . It groups in-order. This means that if you have a list of elements that are not sorted, you may get multiple groups on the same key. So, let's say you want to group a list of integers based on their parity (even vs odd), then you might do this:

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
itertools.groupby(L, lambda i:i%2)

Now, if you come from an SQL world, you might think that this gives you two groups - one for the even numbers, and one for the odd numbers. While this makes sense, it is not how Python does things. It considers each element in turn and checks if it belongs to the same group as the previous element. If so, both elements are added to the group; else, each element gets its own group.

So with the above list, we get:

key: 1
elements: [1]

key: 0
elements[2]

key: 1
elements: [3]

key: 0
elements[4]

key: 1
elements: [5,7]  # see what happened here?

So if you're looking to make a grouping like in SQL, then you'll want to sort the list before hand, by the key (criteria) with which you want to group:

L = [1,2,3,4,5,7,8]  # notice that there's no 6 in the list
L.sort(key=lambda i:i%2)  # now L looks like this: [2,4,1,3,5,7] - the odds and the evens stick together
itertools.groupby(L, lambda i:%2)  # this gives two groups containing all the elements that belong to each group

I've tried to make a short and concise answer but I didn't suceed but I've managed to get a lot of python builtin modules involved:

import itertools
import operator
import functools

I'll use functools.reduce to do the sums but it needs a custom function:

def sum_sum_sum_counter(res, array):
    # Unpack the values of the array
    year, day, val1, val2, val3 = array
    res[0] += val1
    res[1] += val2
    res[2] += val3
    res[3] += 1 # counter
    return res

This function has a counter because you want to calculate the average it's more intuitive than a running mean implementation.

Now the fun part: I'll group by the first two elements (assuming these are sorted otherwise one would need something like lst = sorted(lst, key=operator.itemgetter(0,1)) before:

result = []
for i, values in itertools.groupby(lst, operator.itemgetter(0,1)):
    # Now let's use the reduce function with a start list containing zeros
    calc = functools.reduce(sum_sum_sum_counter, values, [0, 0, 0, 0])
    # Append year, day and the results.
    result.append([i[0], i[1], calc[0], calc[1], calc[2]/calc[3]])

The calc[2]/calc[3] is the average of value3. Remember the last element in the reduce function was a counter! And a sum divided by the counts is the average.

Giving me a result:

[[2014, 1, 13, 27, 17.0],
 [2014, 2, 47, 44, 5.5],
 [2013, 1, 34, 54, 3.0],
 [2013, 2, 23, 33, 2.0]]

just using those values you've given.

On real data, sorting before grouping might become inefficient:

  • firstly the complete iterator will be consumed, loosing one important goal of functional programming, laziness
  • sorting is O(n log n) compared to grouping O(n)

To group by some predicate the SQL + pythonic way, some simple reduce/accumulate using a collection.defaultdict will do:

from functools import reduce
from collections import defaultdict as DD

def groupby( pred, it ):
  return reduce( lambda d,x: d[ pred(x) ].append(x) or d, it, DD(list) )

Then use it with some predicate function or lambda:

>>> words = 'your code might become less readable using reduce'.split()
>>> groupby( len, words )[4]
['your', 'code', 'less']

Concerning laziness, reduce won't return before consuming all input, neither of course. You might use itertools.accumulate, instead, always returning the same defaultdict, to consume (and process the changing groups) lazily and with low memory footprint.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM