简体   繁体   English

在Python中分组并汇总字典列表的值

[英]Group by and aggregate the values of a list of dictionaries in Python

I'm trying to write a function, in an elegant way, that will group a list of dictionaries and aggregate (sum) the values of like-keys. 我正在尝试以优雅的方式编写一个函数,它将对字典列表进行分组并聚合(求和)like-keys的值。

Example: 例:

my_dataset = [  
    {
        'date': datetime.date(2013, 1, 1),
        'id': 99,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 1),
        'id': 98,
        'value1': 10,
        'value2': 10
    },
    {
        'date': datetime.date(2013, 1, 2),
        'id' 99,
        'value1': 10,
        'value2': 10
    }
]

group_and_sum_dataset(my_dataset, 'date', ['value1', 'value2'])

"""
Should return:
[
    {
        'date': datetime.date(2013, 1, 1),
        'value1': 20,
        'value2': 20
    },
    {
        'date': datetime.date(2013, 1, 2),
        'value1': 10,
        'value2': 10
    }
]
"""

I've tried doing this using itertools for the groupby and summing each like-key value pair, but am missing something here. 我已经尝试使用itertools为groupby和总结每个like-key值对,但我在这里遗漏了一些东西。 Here's what my function currently looks like: 这是我的功能目前的样子:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):
    keyfunc = operator.itemgetter(group_by_key)
    dataset.sort(key=keyfunc)
    new_dataset = []
    for key, index in itertools.groupby(dataset, keyfunc):
        d = {group_by_key: key}
        d.update({k:sum([item[k] for item in index]) for k in sum_value_keys})
        new_dataset.append(d)
    return new_dataset

You can use collections.Counter and collections.defaultdict . 您可以使用collections.Countercollections.defaultdict

Using a dict this can be done in O(N) , while sorting requires O(NlogN) time. 使用dict可以在O(N) ,而排序需要O(NlogN)时间。

from collections import defaultdict, Counter
def solve(dataset, group_by_key, sum_value_keys):
    dic = defaultdict(Counter)
    for item in dataset:
        key = item[group_by_key]
        vals = {k:item[k] for k in sum_value_keys}
        dic[key].update(vals)
    return dic
... 
>>> d = solve(my_dataset, 'date', ['value1', 'value2'])
>>> d
defaultdict(<class 'collections.Counter'>,
{
 datetime.date(2013, 1, 2): Counter({'value2': 10, 'value1': 10}),
 datetime.date(2013, 1, 1): Counter({'value2': 20, 'value1': 20})
})

The advantage of Counter is that it'll automatically sum the values of similar keys.: Counter的优点是它会自动将相似键的值相加:

Example: 例:

>>> c = Counter(**{'value1': 10, 'value2': 5})
>>> c.update({'value1': 7, 'value2': 3})
>>> c
Counter({'value1': 17, 'value2': 8})

Thanks, I forgot about Counter. 谢谢,我忘记了Counter。 I still wanted to maintain the output format and sorting of my returned dataset, so here's what my final function looks like: 我仍然想维护输出格式和我返回的数据集的排序,所以这是我的最终函数的样子:

def group_and_sum_dataset(dataset, group_by_key, sum_value_keys):

    container = defaultdict(Counter)

    for item in dataset:
        key = item[group_by_key]
        values = {k:item[k] for k in sum_value_keys}
        container[key].update(values)

    new_dataset = [
        dict([(group_by_key, item[0])] + item[1].items())
            for item in container.items()
    ]
    new_dataset.sort(key=lambda item: item[group_by_key])

    return new_dataset

Here's an approach using more_itertools where you simply focus on how to construct output. 这是一种使用more_itertools的方法,您只需关注如何构造输出。

Given 特定

import datetime
import collections as ct

import more_itertools as mit


dataset = [
    {"date": datetime.date(2013, 1, 1), "id": 99, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 1), "id": 98, "value1": 10, "value2": 10},
    {"date": datetime.date(2013, 1, 2), "id": 99, "value1": 10, "value2": 10}
]

Code

# Step 1: Build helper functions    
kfunc = lambda d: d["date"]
vfunc = lambda d: {k:v for k, v in d.items() if k.startswith("val")}
rfunc = lambda lst: sum((ct.Counter(d) for d in lst), ct.Counter())

# Step 2: Build a dict    
reduced = mit.map_reduce(dataset, keyfunc=kfunc, valuefunc=vfunc, reducefunc=rfunc)
reduced

Output 产量

defaultdict(None,
            {datetime.date(2013, 1, 1): Counter({'value1': 20, 'value2': 20}),
             datetime.date(2013, 1, 2): Counter({'value1': 10, 'value2': 10})})

The items are grouped by date and pertinent values are reduced as Counters . 这些项目按日期分组,相关值减少为Counters


Details 细节

Steps 脚步

  1. build helper functions to customize construction of keys , values and reduced values in the final defaultdict . 构建辅助函数以自定义最终defaultdict减少值的构造。 Here we want to: 在这里,我们希望:
    • group by date ( kfunc ) 按日期分组( kfunc
    • built dicts keeping the "value*" parameters ( vfunc ) 内置dicts保持“value *”参数( vfunc
    • aggregate the dicts ( rfunc ) by converting to collections.Counters and summing them . 通过转换为collections.Counters并对它们求和来聚合rfuncrfunc )。 See an equivalent rfunc below + . 查看下面的等效rfunc +
  2. pass in the helper functions to more_itertools.map_reduce . 将辅助函数传递给more_itertools.map_reduce

Simple Groupby 简单的Groupby

... say in that example you wanted to group by id and date? ...在那个例子中你想按ID和日期分组?

No problem. 没问题。

>>> kfunc2 = lambda d: (d["date"], d["id"])
>>> mit.map_reduce(dataset, keyfunc=kfunc2, valuefunc=vfunc, reducefunc=rfunc)
defaultdict(None,
            {(datetime.date(2013, 1, 1),
              99): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 1),
              98): Counter({'value1': 10, 'value2': 10}),
             (datetime.date(2013, 1, 2),
              99): Counter({'value1': 10, 'value2': 10})})

Customized Output 定制输出

While the resulting data structure clearly and concisely presents the outcome, the OP's expected output can be rebuilt as a simple list of dicts: 虽然最终的数据结构清晰简明地显示了结果,但OP的预期输出可以重建为一个简单的dicts列表:

>>> [{**dict(date=k), **v} for k, v in reduced.items()]
[{'date': datetime.date(2013, 1, 1), 'value1': 20, 'value2': 20},
 {'date': datetime.date(2013, 1, 2), 'value1': 10, 'value2': 10}]

For more on map_reduce , see the docs . 有关map_reduce更多map_reduce ,请参阅文档 Install via > pip install more_itertools . 通过> pip install more_itertools

+ An equivalent reducing function: +等效的减少功能:

def rfunc(lst: typing.List[dict]) -> ct.Counter:
    """Return reduced mappings from map-reduce values."""
    c = ct.Counter()
    for d in lst:
        c += ct.Counter(d)
    return c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM