使用 itertools 通过复合键对字典列表中的重复项求和

Question

I have a sorted list of dictionaries like so:我有一个这样的字典排序列表：

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates.这实际上是 (id1, id2, value) 元组，但存在重复。 I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.我想通过对两个 id 相等的值求和来删除重复数据，留下唯一的 (id1, id2) 对，其中新值是重复项的总和。

That is, from above, the desired output is:也就是说，从上面来看，所需的 output 是：

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates.假设列表有数百万个，其中有很多重复项。 What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?使用itertools或funcy （相对于使用 pandas）执行此操作的最有效方法是什么？

Answer 1

You can start with collections.Counter and use the += operator, the convenient part of the Counter is that += assumes zero on inexisting keys.您可以从collections.Counter开始并使用+=运算符， Counter的方便部分是+=在不存在的键上假定为零。

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving给予

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

Answer 2

We could use collections.defaultdict as well:我们也可以使用collections.defaultdict ：

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby :或者（假设 ID 已排序）， itertools.groupby ：

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby + sum + to_dict in pandas :或pandas中的groupby + sum + to_dict ：

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output: Output：

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:所提供数据的基本基准表明使用itemgetter的groupby （如@ShadowRanger 所建议的那样）是最快的：

defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) defaultdict： 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Counter ( Bob ): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)计数器 ( Bob )： 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
groupby + itemgetter ( ShadowRanger ): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) groupby + itemgetter ( ShadowRanger )： 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) groupby + lambda： 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) pandas： 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, ie Do现在如果我们复制dat 100 万次，即 Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:并再次执行相同的基准测试，带有itemgetter的groupby是失控的赢家：

3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit) .在 Python 3.9.7（64 位）上运行。

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts.这个基准在某种程度上有利于groupby ，因为当我们复制现有的小型字典列表时，组很少。 If create randomize the size of "group"s, groupby + itemgetter still beats all but the difference is not as stark.如果创建随机化“组”的大小， groupby + itemgetter仍然胜过所有，但差异并不那么明显。

Answer 3

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):只是为了好玩，一个纯粹的itertools解决方案（不使用collections或以其他方式使用任何中间容器，如果list已经按键顺序，则必须逐步构建和更新，但如果你不能保证它已经存在，则需要预先排序排序以将唯一 ID 对组合在一起）：

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data ( groupby is O(n) , but if you need to sort first, the sorting is O(n log n) ).就个人而言，我通常会使用其他答案中所示的Counter或defaultdict(int) ，因为即使使用未排序的数据它们也能获得O(n)性能（ groupby是O(n) ，但如果您需要先排序，则排序是O(n log n) ）。 Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetter s);基本上，这甚至具有理论上优势的唯一一次是当数据已经排序并且您重视使用单行（不包括导入和一次性设置成本来制作itemgetter s）； in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter / collections.defaultdict(int) , especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).在实践中， itertools.groupby有足够的开销，它通常仍然输给collections.Counter / collections.defaultdict(int)中的一个或两个，尤其是在其优化模式下使用collections.Counter来计算要计数的事物的迭代次数时（不' 在这里申请，但值得了解）。

使用 itertools 通过复合键对字典列表中的重复项求和

问题描述

3 个解决方案

解决方案1
2 2022-03-02 20:30:22

解决方案2
2

解决方案3
1 2022-03-02 20:58:20

使用 itertools 通过复合键对字典列表中的重复项求和

问题描述

3 个解决方案

解决方案1 2 2022-03-02 20:30:22

解决方案2 2

解决方案3 1 2022-03-02 20:58:20

解决方案1
2 2022-03-02 20:30:22

解决方案2
2

解决方案3
1 2022-03-02 20:58:20