简体   繁体   中英

Filter a list of dictionaries to remove duplicates within a key, based on another key

I have a list of dictionaries in Python 3.5.2 that I am attempting to "deduplicate". All of the dictionaries are unique, but there is a specific key I would like to deduplicate on, keeping the dictionary with the most non-null values.

For example, I have the following list of dictionaries:

d1 = {"id":"a", "foo":"bar", "baz":"bat"}
d2 = {"id":"b", "foo":"bar", "baz":None}
d3 = {"id":"a", "foo":"bar", "baz":None}
d4 = {"id":"b", "foo":"bar", "baz":"bat"}
l = [d1, d2, d3, d4]

I would like to filter l to just dictionaries with unique id keys, keeping the dictionary that has the fewest nulls. In this case the function should keep d1 and d4 .

What I attempted was to create a new key,val pair for "value count" like so:

for d in l:
    d['val_count'] = len(set([v for v in d.values() if v]))

now what I am stuck on is how to go about filtering my list of dicts for unique ids where the val_count key is the greater value.

I am open to other approaches, but I am unable to use pandas for this project due to resource constraints.

Expected output:

l = [{"id":"a", "foo":"bar", "baz":"bat"},
 {"id":"b", "foo":"bar", "baz":"bat"}]

I would use groupby and just pick the first one from each group:

1) First sort your list by key (to create the groups) and descending count of nulls (your stated goal):

>>> l2=sorted(l, key=lambda d: (d['id'], -sum(1 for v in d.values() if v))) 

2) Then group by id and take the first element of each iterator presented as d in the groupby on the sorted list:

>>> from itertools import groupby
>>> [next(d) for _,d in groupby(l2, key=lambda _d: _d['id'])]
[{'id': 'a', 'foo': 'bar', 'baz': 'bat'}, {'id': 'b', 'foo': 'bar', 'baz': 'bat'}]

If you want a 'tie breaker' to select the first dict if otherwise they have the same null count, you can add an enumerate decorator:

>>> l2=sorted(enumerate(l), key=lambda t: (t[1]['id'], t[0], -sum(1 for v in t[1].values() if v)))
>>> [next(d)[1] for _,d in groupby(l2, key=lambda t: t[1]['id'])]

I doubt that additional step is actually necessary though since Python's sort (and sorted ) is a stable sort and the sequence will only change from list order based on the key and void counts. So use the first version unless you are sure you need to use the second.

You can use max :

d1 = {"id":"a", "foo":"bar", "baz":"bat"}
d2 = {"id":"b", "foo":"bar", "baz":None}
d3 = {"id":"a", "foo":"bar", "baz":None}
d4 = {"id":"b", "foo":"bar", "baz":"bat"}
l = [d1, d2, d3, d4]
max_none = max(sum(c is None for c in i.values()) for i in l)
new_l = [i for i in l if sum(c is None for c in i.values()) < max_none]

Output:

[{'foo': 'bar', 'baz': 'bat', 'id': 'a'}, {'foo': 'bar', 'baz': 'bat', 'id': 'b'}]

If you are open to using a 3rd party library, you can sort by number of None values and then feed into toolz.unique :

from toolz import unique
from operator import itemgetter

l_sorted = sorted(l, key=lambda x: sum(v is None for v in x.values()))
res = list(unique(l_sorted, key=itemgetter('id')))

[{'baz': 'bat', 'foo': 'bar', 'id': 'a'},
 {'baz': 'bat', 'foo': 'bar', 'id': 'b'}]

If you cannot use toolz , the source code is small enough to implement yourself.


Performance benchmarking

I have only included solutions which give exactly one result per id. Many solutions do not cater for a duplicate dictionary.

l = [d1, d2, d3, d4]*1000

%timeit dawg(l)  # 11.4 ms
%timeit jpp(l)   # 7.91 ms
%timeit tsw(l)   # 4.23 s

from operator import itemgetter
from itertools import groupby
from toolz import unique

def dawg(l):
    l2=sorted(enumerate(l), key=lambda t: (t[1]['id'], -sum(1 for v in t[1].values() if v), t[0]))
    return [next(d)[1] for _,d in groupby(l2, key=lambda t: t[1]['id'])]

def jpp(l):
    l_sorted = sorted(l, key=lambda x: sum(v is None for v in x.values()))
    return list(unique(l_sorted, key=itemgetter('id')))

def tsw(l):
    for d in l:
        d['val_count'] = len(set([v for v in d.values() if v]))
    new = [d for d in l if d['val_count'] == max([d_other['val_count'] for d_other in l if d_other['id'] == d['id']])]
    return [x for i, x in enumerate(new) if x['id'] not in {y['id'] for y in new[:i]}]

Here's one way using a list comprehension which uses the 'val_count' values which you've already calculated:

new = [d for d in l if d['val_count'] == max([d_other['val_count'] for d_other in l if d_other['id'] == d['id']])]

Giving:

[{'baz': 'bat', 'foo': 'bar', 'id': 'a', 'val_count': 3},
 {'baz': 'bat', 'foo': 'bar', 'id': 'b', 'val_count': 3}]

This works by comparing the current dictionary's 'val_count' to the maximum ' val_count' of all dictionaries with the same 'id' . Note that in the case of ties, all dictionaries which have the max 'val_count' are kept.

The following line should handle ties, keeping the first instance of a certain 'id' only:

final = [x for i, x in enumerate(new) if x['id'] not in {y['id'] for y in new[:i]}]

There will almost certainly be more efficient ways to solve this problem, but this should at least work and may be suitable for your needs depending on the size of your dataset.

I would do like this:

num = [list(x.values()).count(None) for x in l]
ls = [x for _,x in sorted(zip(num, l), key=lambda z: z[0])]

Then keep as many values as you want from the sorted list ( ls ).

For example, in order to keep only those dictionaries with the highest number non- None values (all dictionaries with the same number of non- None s), you can do this:

num = [list(x.values()).count(None) for x in l]
ls, ns = zip(*[(x, d) for d, x in sorted(zip(num, l), key=lambda z: z[0])])
top_l = ls[:list(reversed(ns)).index(ns[0])]

EDIT: Based on @jpp's comment , I have updated my code to take care of duplicate id keys. Here is the updated code:

def agn(l):
    num = [list(x.values()).count(None) for x in l]
    ls, ns = zip(*[(x, d) for d, x in sorted(zip(num, l), key=lambda z: z[0])])
    top_l = ls[:list(reversed(ns)).index(ns[0])]
    return list(dict((d['id'], d) for d in top_l).values())

Let's also add timing comparison using the same definitions and setup as in @jpp's answer :

In [113]: %timeit tsw(l)
3.9 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [114]: %timeit dawg(l)
7.48 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [115]: %timeit jpp(l)
5.83 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %timeit agn(l)
4.58 ms ± 86.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@cdc200 , you can try the below code. Here I have used the concept of dictionary.

Note » Dictionary is defined as an unordered collection of data items with unique keys.

I have used OrderedDict () in place of dict() to preserve the order of keys. Check this nice little article OrderedDict in Python - GeeksforGeeks .

import json
from collections import OrderedDict

d1 = {"id":"a", "foo":"bar", "baz":"bat"}
d2 = {"id":"b", "foo":"bar", "baz":None}
d3 = {"id":"a", "foo":"bar", "baz":None}
d4 = {"id":"b", "foo":"bar", "baz":"bat"}
l = [d1, d2, d3, d4]

d = OrderedDict ();

for index, item in enumerate(l):
    if item["id"] not in d:
        d[item["id"]] =item
    else:
        nones1, nones2 = 0, 0
        for k in item:
            if item[k] is None:
                 nones1 = nones1 + 1
            if d[item["id"]][k] is None:
                 nones2 = nones2 + 1

        if nones2 > nones1:
            d[item["id"]] = item

l = [dict_item for dict_item in d.values()]

print (l)

"""
{'foo': 'bar', 'id': 'a', 'baz': 'bat'}, {'foo': 'bar', 'id': 'b', 'baz': 'bat'}]
"""

# Pretty printing the above dictionary
print(json.dumps(l, indent=4))

"""
[
    {
        "foo": "bar",
        "id": "a",
        "baz": "bat"
    },
    {
        "foo": "bar",
        "id": "b",
        "baz": "bat"
    }
]
"""

Thanks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM