简体   繁体   中英

Summing duplicates in a list of dictionaries by a compound key using itertools

I have a sorted list of dictionaries like so:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.

That is, from above, the desired output is:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?

You can start with collections.Counter and use the += operator, the convenient part of the Counter is that += assumes zero on inexisting keys.

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]

We could use collections.defaultdict as well:

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby :

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby + sum + to_dict in pandas :

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:

  1. defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. Counter ( Bob ): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  3. groupby + itemgetter ( ShadowRanger ): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  4. groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  5. pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, ie Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:

  1. 3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  2. 5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  4. 3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  5. 15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit) .

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s, groupby + itemgetter still beats all but the difference is not as stark.

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data ( groupby is O(n) , but if you need to sort first, the sorting is O(n log n) ). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetter s); in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter / collections.defaultdict(int) , especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM