[英]Summing duplicates in a list of dictionaries by a compound key using itertools
I have a sorted list of dictionaries like so:我有一个这样的字典排序列表:
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
This is effectively (id1, id2, value) tuples, but where there are duplicates.这实际上是 (id1, id2, value) 元组,但存在重复。 I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.
我想通过对两个 id 相等的值求和来删除重复数据,留下唯一的 (id1, id2) 对,其中新值是重复项的总和。
That is, from above, the desired output is:也就是说,从上面来看,所需的 output 是:
dat =[
{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}
]
Assume the list is millions with lots of duplicates.假设列表有数百万个,其中有很多重复项。 What's the most efficient way to do this using
itertools
or funcy
(versus say, using pandas)?使用
itertools
或funcy
(相对于使用 pandas)执行此操作的最有效方法是什么?
You can start with collections.Counter
and use the +=
operator, the convenient part of the Counter
is that +=
assumes zero on inexisting keys.您可以从
collections.Counter
开始并使用+=
运算符, Counter
的方便部分是+=
在不存在的键上假定为零。
dat = [
{"id1": 1, "id2": 2, "value": 1},
{"id1": 1, "id2": 2, "value": 2},
{"id1": 2, "id2": 2, "value": 2},
{"id1": 2, "id2": 3, "value": 1},
{"id1": 3, "id2": 3, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
{"id1": 3, "id2": 4, "value": 1},
]
from collections import Counter
cnt = Counter()
for item in dat:
cnt[item["id1"], item["id2"]] += item["value"]
[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]
Giving给予
[{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}]
We could use collections.defaultdict
as well:我们也可以使用
collections.defaultdict
:
from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]
or (assuming the ids are sorted), itertools.groupby
:或者(假设 ID 已排序),
itertools.groupby
:
from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]
or groupby
+ sum
+ to_dict
in pandas
:或
pandas
中的groupby
+ sum
+ to_dict
:
out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')
Output: Output:
[{'id1': 1, 'id2': 2, 'value': 3},
{'id1': 2, 'id2': 2, 'value': 2},
{'id1': 2, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 3, 'value': 1},
{'id1': 3, 'id2': 4, 'value': 4}]
A basic benchmark on the provided data says groupby
using itemgetter
(as suggested by @ShadowRanger) is the fastest:所提供数据的基本基准表明使用
itemgetter
的groupby
(如@ShadowRanger 所建议的那样)是最快的:
6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now if we duplicate dat
1mil times, ie Do现在如果我们复制
dat
100 万次,即 Do
dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))
and do the same benchmark again, groupby
with itemgetter
is the runaway winner:并再次执行相同的基准测试,带有
itemgetter
的groupby
是失控的赢家:
3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ran on Python 3.9.7 (64bit) .在 Python 3.9.7(64 位)上运行。
This benchmark somewhat favors groupby
since there are very few groups when we duplicate an existing small list of dicts.这个基准在某种程度上有利于
groupby
,因为当我们复制现有的小型字典列表时,组很少。 If create randomize the size of "group"s, groupby
+ itemgetter
still beats all but the difference is not as stark.如果创建随机化“组”的大小,
groupby
+ itemgetter
仍然胜过所有,但差异并不那么明显。
Just for fun, a purely itertools
solution (no use of collections
or otherwise using any intermediate containers that must be built and updated incrementally if the list
is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):只是为了好玩,一个纯粹的
itertools
解决方案(不使用collections
或以其他方式使用任何中间容器,如果list
已经按键顺序,则必须逐步构建和更新,但如果你不能保证它已经存在,则需要预先排序排序以将唯一 ID 对组合在一起):
# At top of file
from itertools import groupby
# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')
# On each use:
dat.sort(key=get_ids) # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
for (id1, id2), group in groupby(dat, key=get_ids)]
# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]
Personally, I'd generally use Counter
or defaultdict(int)
as shown in the other answers, as they get O(n)
performance even with unsorted data ( groupby
is O(n)
, but if you need to sort first, the sorting is O(n log n)
).就个人而言,我通常会使用其他答案中所示的
Counter
或defaultdict(int)
,因为即使使用未排序的数据它们也能获得O(n)
性能( groupby
是O(n)
,但如果您需要先排序,则排序是O(n log n)
)。 Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetter
s);基本上,这甚至具有理论上优势的唯一一次是当数据已经排序并且您重视使用单行(不包括导入和一次性设置成本来制作
itemgetter
s); in practice, itertools.groupby
has sufficient overhead that it still typically loses to one or both of collections.Counter
/ collections.defaultdict(int)
, especially when using collections.Counter
in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).在实践中,
itertools.groupby
有足够的开销,它通常仍然输给collections.Counter
/ collections.defaultdict(int)
中的一个或两个,尤其是在其优化模式下使用collections.Counter
来计算要计数的事物的迭代次数时(不' 在这里申请,但值得了解)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.