简体   繁体   English

使用 itertools 通过复合键对字典列表中的重复项求和

[英]Summing duplicates in a list of dictionaries by a compound key using itertools

I have a sorted list of dictionaries like so:我有一个这样的字典排序列表:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates.这实际上是 (id1, id2, value) 元组,但存在重复。 I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.我想通过对两个 id 相等的值求和来删除重复数据,留下唯一的 (id1, id2) 对,其中新值是重复项的总和。

That is, from above, the desired output is:也就是说,从上面来看,所需的 output 是:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates.假设列表有数百万个,其中有很多重复项。 What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?使用itertoolsfuncy (相对于使用 pandas)执行此操作的最有效方法是什么?

You can start with collections.Counter and use the += operator, the convenient part of the Counter is that += assumes zero on inexisting keys.您可以从collections.Counter开始并使用+=运算符, Counter的方便部分是+=在不存在的键上假定为零。

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving给予

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

We could use collections.defaultdict as well:我们也可以使用collections.defaultdict

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby :或者(假设 ID 已排序), itertools.groupby

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby + sum + to_dict in pandas :pandas中的groupby + sum + to_dict

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output: Output:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:所提供数据的基本基准表明使用itemgettergroupby (如@ShadowRanger 所建议的那样)是最快的:

  1. defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. Counter ( Bob ): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)计数器 ( Bob ): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  3. groupby + itemgetter ( ShadowRanger ): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) groupby + itemgetter ( ShadowRanger ): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  4. groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  5. pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, ie Do现在如果我们复制dat 100 万次,即 Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:并再次执行相同的基准测试,带有itemgettergroupby是失控的赢家:

  1. 3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  2. 5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  4. 3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  5. 15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit) .在 Python 3.9.7(64 位)上运行

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts.这个基准在某种程度上有利于groupby ,因为当我们复制现有的小型字典列表时,组很少。 If create randomize the size of "group"s, groupby + itemgetter still beats all but the difference is not as stark.如果创建随机化“组”的大小, groupby + itemgetter仍然胜过所有,但差异并不那么明显。

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):只是为了好玩,一个纯粹的itertools解决方案(不使用collections或以其他方式使用任何中间容器,如果list已经按键顺序,则必须逐步构建和更新,但如果你不能保证它已经存在,则需要预先排序排序以将唯一 ID 对组合在一起):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data ( groupby is O(n) , but if you need to sort first, the sorting is O(n log n) ).就个人而言,我通常会使用其他答案中所示的Counterdefaultdict(int) ,因为即使使用未排序的数据它们也能获得O(n)性能( groupbyO(n) ,但如果您需要先排序,则排序是O(n log n) )。 Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetter s);基本上,这甚至具有理论上优势的唯一一次是当数据已经排序并且您重视使用单行(不包括导入和一次性设置成本来制作itemgetter s); in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter / collections.defaultdict(int) , especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).在实践中, itertools.groupby有足够的开销,它通常仍然输给collections.Counter / collections.defaultdict(int)中的一个或两个,尤其是在其优化模式下使用collections.Counter来计算要计数的事物的迭代次数时(不' 在这里申请,但值得了解)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从在 Python 中使用 groupby itertools 创建的字典列表中删除重复项 - Remove duplicates from list of dictionaries created using groupby itertools in Python 在字典列表中按键保留重复项 - keep duplicates by key in a list of dictionaries 使用zip(),map()函数和itertools对列表中的列表元素求和 - Summing elements of lists in list using zip(), map() functions and itertools 从字典列表中删除重复项 python - Remove duplicates key from list of dictionaries python 使用字典和复合字段格式(),整数键存储为字符串 - Using dictionaries and compound field format() with Integer key stored as String 使用 itertools 和列表理解创建一个又一个样本而不重复 - Creating samples after sample without duplicates using itertools and list comprehensions 根据另一个密钥过滤字典列表以删除密钥中的重复项 - Filter a list of dictionaries to remove duplicates within a key, based on another key Python字典列表中的值求和 - Summing values in Python list of dictionaries 根据字典列表中的特定字典键检测和删除重复项 - Detect and delete duplicates based on specific dictionary key in a list of dictionaries 从自定义词典的嵌套对象列表中按键删除重复项 - remove duplicates by key from nested list of objects of custom dictionaries
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM