Python - 基於鍵/值標識分組/合並字典

Question

我有一個列表，其中包含許多具有相同鍵但不同值的字典。

我想做的是根據某些鍵的值對字典進行分組/合並。 展示一個例子而不是試圖解釋可能會更快：

[{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

所以，我想要實現的是合並第一個和第三個字典，因為它們具有相同的“區域”、“小時”和“工作日”，將 C1 和 C2 中的值相加：

[{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 10, 'C2': 27},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

這里有什么幫助嗎？ :) 我已經為此苦苦掙扎了幾天，我有一個糟糕的不可擴展的解決方案，但我確信我可以實施更pythonic的東西。

謝謝！

Answer 1

通過使用defaultdict，您可以在線性時間內合並它們。

from collections import defaultdict

res = defaultdict(lambda : defaultdict(int))

for d in dictionaries:
        res[(d['zone'],d['weekday'],d['hour'])]['C1']+= d['C1']
        res[(d['zone'],d['weekday'],d['hour'])]['C2']+= d['C2']

缺點是您需要另一遍才能獲得您定義的輸出。

Answer 2

我已經寫了一個稍長的解決方案，使用名稱元組作為字典的鍵：

from collections import namedtuple

zones = [{'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
 {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
 {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
 {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}]

ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
results = dict()

for zone in zones:
    zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
    if zone_time in results:
        results[zone_time]['C1'] += zone['C1']
        results[zone_time]['C2'] += zone['C2']
    else:
        results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}


print(results)

這使用 (zone, weekday, hour) 的命名元組作為每個字典的鍵。 然后，如果它已經存在於results ，則添加它或在字典中創建一個新條目是相當簡單的。

你絕對可以讓這個更短更“聰明”，但它可能變得不那么可讀。

Answer 3

按相關鍵排序然后分組； 迭代組並創建具有總和值的新字典。

import operator
import itertools

keys = operator.itemgetter('zone','weekday','hour')
c1_c2 = operator.itemgetter('C1','C2')

# data is your list of dicts
data.sort(key=keys)
grouped = itertools.groupby(data,keys)

new_data = []
for (zone,weekday,hour),g in grouped:
    c1,c2 = 0,0
    for d in g:
        c1 += d['C1']
        c2 += d['C2']
    new_data.append({'zone':zone,'weekday':weekday,
                     'hour':hour,'C1':c1,'C2':c2})

最后一個循環也可以寫成：

for (zone,weekday,hour),g in grouped:
    cees = map(c1_c2,g)
    c1,c2 = map(sum,zip(*cees))
    new_data.append({'zone':zone,'weekday':weekday,
                     'hour':hour,'C1':c1,'C2':c2})

Answer 4

編輯：運行時間比較

我最初的答案（見下文）不是一個好的答案，但我認為通過對其他答案進行一些運行時分析，我做出了有用的貢獻，因此我編輯了該部分並將其放在頂部。 在這里，我包括了其他三個解決方案，以及產生所需輸出所需的轉換。 為了完整DataFrame ，我還包括一個使用pandas的版本，它假設用戶正在使用DataFrame （從DataFrame列表轉換到數據框並返回甚至不值得）。 比較時間根據生成的隨機數據略有不同，但這些都是相當有代表性的：

>>> run_timer(100)
Times with 100 values
    ...with defaultdict: 0.1496697600000516
    ...with namedtuple: 0.14976404899994122
    ...with groupby: 0.0690777249999428
    ...with pandas: 3.3165711250001095
>>> run_timer(1000)
Times with 1000 values
    ...with defaultdict: 1.267153091999944
    ...with namedtuple: 0.9605341750000207
    ...with groupby: 0.6634409229998255
    ...with pandas: 3.5146895360001054
>>> run_timer(10000)
Times with 10000 values
    ...with defaultdict: 9.194478484000001
    ...with namedtuple: 9.157486462000179
    ...with groupby: 5.18553969300001
    ...with pandas: 4.704001281000046
>>> run_timer(100000)
Times with 100000 values
    ...with defaultdict: 59.644778522000024
    ...with namedtuple: 89.26688319799996
    ...with groupby: 93.3517027989999
    ...with pandas: 14.495209061999958

帶走：

使用 Pandas 數據框可以為大型數據集帶來大量時間
- 注意：我不包括字典列表和數據框之間的轉換，這絕對是重要的
否則，公認的解決方案（二戰）適用於中小型數據集，但對於非常大的數據集，它可能是最慢的
改變組的大小（例如，通過減少區域的數量）有一個巨大的影響，這里沒有檢查

這是我用來生成上述腳本的腳本。

import random
import pandas

from timeit import timeit

from functools import partial

from itertools import groupby
from operator import itemgetter

from collections import namedtuple, defaultdict


def with_pandas(df):
    return df.groupby(['zone', 'weekday', 'hour']).agg(sum).reset_index()


def with_groupby(data):
    keys = itemgetter('zone', 'weekday', 'hour')

    # data is your list of dicts
    data.sort(key=keys)
    grouped = groupby(data, keys)

    new_data = []
    for (zone, weekday, hour), g in grouped:
        c1, c2 = 0, 0
        for d in g:
            c1 += d['C1']
            c2 += d['C2']
        new_data.append({'zone': zone, 'weekday': weekday,
                         'hour': hour, 'C1': c1, 'C2': c2})

    return new_data


def with_namedtuple(zones):
    ZoneTime = namedtuple("ZoneTime", ["zone", "weekday", "hour"])
    results = dict()
    for zone in zones:
        zone_time = ZoneTime(zone['zone'], zone['weekday'], zone['hour'])
        if zone_time in results:
            results[zone_time]['C1'] += zone['C1']
            results[zone_time]['C2'] += zone['C2']
        else:
            results[zone_time] = {'C1': zone['C1'], 'C2': zone['C2']}
    return [
        {
            'zone': key[0],
            'weekday': key[1],
            'hour': key[2],
            **val
        }
        for key, val in results.items()
    ]


def with_defaultdict(dictionaries):
    res = defaultdict(lambda: defaultdict(int))
    for d in dictionaries:
        res[(d['zone'], d['weekday'], d['hour'])]['C1'] += d['C1']
        res[(d['zone'], d['weekday'], d['hour'])]['C2'] += d['C2']
    return [
        {
            'zone': key[0],
            'weekday': key[1],
            'hour': key[2],
            **val
        }
        for key, val in res.items()
    ]


def gen_random_vals(num):
    return [
        {
            'zone': random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
            'weekday': random.randint(1, 7),
            'hour': random.randint(0, 23),
            'C1': random.randint(1, 50),
            'C2': random.randint(1, 50),
        }
        for idx in range(num)
    ]


def run_timer(num_vals=1000, timeit_num=1000):
    vals = gen_random_vals(num_vals)
    df = pandas.DataFrame(vals)
    p_fmt = "\t...with %s: %s"
    times = {
        'defaultdict': timeit(stmt=partial(with_defaultdict, vals), number=timeit_num),
        'namedtuple': timeit(stmt=partial(with_namedtuple, vals), number=timeit_num),
        'groupby': timeit(stmt=partial(with_groupby, vals), number=timeit_num),
        'pandas': timeit(stmt=partial(with_pandas, df), number=timeit_num),
    }
    print("Times with %d values" % num_vals)
    for key, val in times.items():
        print(p_fmt % (key, val))

在哪里

with_groupby使用wwii 的解決方案
with_namedtuple使用Jose Salvatierra 的解決方案
with_defaultdict使用abc 的解決方案
with_pandas使用了 Alexander Cécile 在評論中提出的解決方案
- 假設數據已經在DataFrame並產生一個DataFrame作為結果

原答案：

只是為了好玩，這是使用groupby的一種完全不同的方法。 當然，它不是最漂亮的，但它應該相當快。

from itertools import groupby
from operator import itemgetter
from pprint import pprint

vals = [
    {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 3, 'C2': 15},
    {'zone': 'B', 'weekday': 2, 'hour': 6,  'C1': 5, 'C2': 27},
    {'zone': 'A', 'weekday': 1, 'hour': 12,  'C1': 7, 'C2': 12},
    {'zone': 'C', 'weekday': 5, 'hour': 8,  'C1': 2, 'C2': 13}
]
ordered = sorted(
    [
        (
            (row['zone'], row['weekday'], row['hour']),
            row['C1'], row['C2']
        )
        for row in vals
    ]
)


def invert_columns(grp):
    return zip(*[g_row[1:] for g_row in grp])


merged = [
    {
        'zone': key[0],
        'weekday': key[1],
        'hour': key[2],
        **dict(
            zip(["C1", "C2"], [sum(col) for col in invert_columns(grp)])
        )
    }
    for key, grp in groupby(ordered, itemgetter(0))
]

pprint(merged)

這產生

[{'C1': 10, 'C2': 27, 'hour': 12, 'weekday': 1, 'zone': 'A'},
 {'C1': 5, 'C2': 27, 'hour': 6, 'weekday': 2, 'zone': 'B'},
 {'C1': 2, 'C2': 13, 'hour': 8, 'weekday': 5, 'zone': 'C'}]

Python - 基於鍵/值標識分組/合並字典

問題描述

4 個解決方案

解決方案1
3 2019-12-03 17:07:31

解決方案2
2 2019-12-03 17:09:45

解決方案3
2 已采納 2019-12-03 19:47:51

解決方案4
1 2019-12-03 17:40:53

編輯：運行時間比較

原答案：

Python - 基於鍵/值標識分組/合並字典

問題描述

4 個解決方案

解決方案1 3 2019-12-03 17:07:31

解決方案2 2 2019-12-03 17:09:45

解決方案3 2 已采納 2019-12-03 19:47:51

解決方案4 1 2019-12-03 17:40:53

編輯：運行時間比較

原答案：

解決方案1
3 2019-12-03 17:07:31

解決方案2
2 2019-12-03 17:09:45

解決方案3
2 已采納 2019-12-03 19:47:51

解決方案4
1 2019-12-03 17:40:53