简体   繁体   English

Python。 用 SUM 合并 dict 行

[英]Python. Merge dict rows with SUM

I have a lot of dict rows, more than 10 million, like this:我有很多 dict 行,超过 1000 万,如下所示:

{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '25'}
{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '35'}
{'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'}
{'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}

Is it possible to merge rows where all others key and values are the same into one make SUM of 'bytes': I would like to minimize the number of rows and have like this.是否可以将所有其他键和值都相同的行合并为一个使总和为'bytes':我想最小化行数并拥有这样的。 It should speed up the next steps of processing.它应该加快处理的后续步骤。

{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '60'}
{'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'}
{'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}

Thanks in advance.提前致谢。

Using an intermediate dictionary indexed on all "other" keys, you can accumulate the 'byte' values in a common dictionary for each combination of other fields.使用在所有“其他”键上建立索引的中间字典,您可以在公共字典中为其他字段的每个组合累积“字节”值。 then convert the indexed values back into a list of dictionaries:然后将索引值转换回字典列表:

lst = [{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '25'},
       {'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '35'},
       {'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'},
       {'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}]

merged = dict()
for d in lst:
    k = map(d.get,sorted({*d}-{"bytes"}))  # index on all other fields
    m = merged.setdefault(tuple(k),d)      # add/get first instance
    if m is not d:                         # accumulate bytes (as strings) 
        m['bytes'] = str(int(m['bytes']) + int(d['bytes']))
mergedList = list(merged.values())

print(mergedList)
[{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '60'},
 {'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'},
 {'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}]

This will work without sorting (ie in O(n) time) even if your data is not grouped by the combination of other fields.即使您的数据没有按其他字段的组合进行分组,这也无需排序(即在 O(n) 时间内)即可工作。 It will also work if the order of keys are different.如果键的顺序不同,它也可以工作。 Missing keys would be problematic but can be taken into account using a comprehension instead of map(d.get, .缺少键会有问题,但可以使用理解而不是map(d.get, .

Note that you really should store the byte counts as integers instead of strings请注意,您确实应该将字节数存储为整数而不是字符串

the code below should work下面的代码应该可以工作

from collections import defaultdict

lst = [{'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '25'},
       {'value_01': '123', 'value_02': '456', 'datacenter': '1', 'bytes': '35'},
       {'value_01': '678', 'value_02': '901', 'datacenter': '2', 'bytes': '55'},
       {'value_01': '678', 'value_02': '456', 'datacenter': '2', 'bytes': '15'}]
keys = ['value_01', 'value_02', 'datacenter']
data = defaultdict(int)
for entry in lst:
    key = tuple([entry[key] for key in keys])
    data[key] += int(entry['bytes'])
print(data)

output output

defaultdict(<class 'int'>, {('123', '456', '1'): 60, ('678', '901', '2'): 55, ('678', '456', '2'): 15})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM