简体   繁体   中英

How to find repeats or duplicates in nested dictionary?

I have a nested dictionary and I'm trying to find duplicates within in. For example, if I have:

dictionary = {'hello': 3 , 'world':{'this': 5 , 'is':{'a': 3, 'dict': None}}}

The return value would be something like:

True

because this dictionary contains duplicates.

I was able to do this quite easily with a regular dictionary, and I thought this would work well with this case too:

dictionary = {'hello': 3 , 'world':{'this': 5 , 'is':{'a': 3, 'dict': None}}}
rev_dictionary = {}

for key, value in dictionary.items():
    rev_dictionary.setdefault(value, set()).add(key)
    print(rev_dictionary)

for key,values in dictionary.items():
    if len(values) > 1:
        values = True
    else:
        values = False

which throws the following error:

TypeError: unhashable type: 'dict'

How can I get this working?

Thanks for the help!

Note: I'd prefer a solution without using libraries if possible

I am assuming you are defining duplicates by value and not by keys. In that case, you can flatten the nested dict using ( mentioned here )

def flatten(d):
    out = {}
    for key, val in d.items():
        if isinstance(val, dict):
            val = [val]
        if isinstance(val, list):
            for subdict in val:
                deeper = flatten(subdict).items()
                out.update({key + '_' + key2: val2 for key2, val2 in deeper})
        else:
            out[key] = val
    return out

and then check for the condition

v = flatten(d).values()
len(set(v))!=len(v)

results in True

I wrote a simple solution:

dictionary = {'hello': 3 , 'world':{'this': 5 , 'is':{'a': 3, 'dict': None}}}

def get_dups(a, values=None):
    if values is None: values = []
    if (a in values): return True
    values.append(a)
    if type(a) == dict:
        for i in a.values():
            if (get_dups(i, values=values)):
                return True
    return False

print(get_dups(dictionary))

How it works

We start by saving every value in a list, which we will pass into the function. Each run we check whether our current value is in that list, and return True once there is a duplicate.

if (a in values): return True

Next we just loop through the values and run get_dups on them if the current index is also a dictionary.

I think all you need is to flatten the dict before passing to your duplication-detection pipeline:

import pandas as pd

def flatten_dict(d):
    df = pd.io.json.json_normalize(d, sep='_')
    return df.to_dict(orient='records')[0]

dictionary = {'hello': 3 , 'world':{'this': 5 , 'is':{'a': 3, 'dict': None}}}

dictionary = flatten_dict(dictionary)
print('flattend')
print(dictionary)

rev_dictionary = {}

for key, value in dictionary.items():
    rev_dictionary.setdefault(value, set()).add(key)

print('reversed')
print(rev_dictionary)

is_duplicate = False
for key, values in rev_dictionary.items():
    if len(values) > 1:
        is_duplicate = True
        break

print('is duplicate?', is_duplicate)

The result:

flattend
{'hello': 3, 'world_is_a': 3, 'world_is_dict': None, 'world_this': 5}
reversed
{3: {'hello', 'world_is_a'}, None: {'world_is_dict'}, 5: {'world_this'}}
is duplicate? True

Code for flattening a dict borrowed from: Flatten nested Python dictionaries, compressing keys .

You can recursively add item values of sub-dicts to a set, and if any item value is already "seen" in the set, raise an exception so that the wrapper can return True to indicate that a dupe is found:

def has_dupes(d):
    def values(d):
        seen = set()
        for k, v in d.items():
            if isinstance(v, dict):
                s = values(v)
                if seen & s:
                    raise RuntimeError()
                seen.update(s)
            else:
                if v in seen:
                    raise RuntimeError()
                seen.add(v)
        return seen
    try:
        values(d)
    except RuntimeError:
        return True
    return False

so that given your sample input, has_dupes(dictionary) would return: True

Convert the nested dictionary into nested lists of their values:

def nested_values(v):
    return map(nested_values, v.values()) if isinstance(v, dict) else v

Then flatten the nested lists into one list of all values in the dictionaries, and then check the flattened list of values for duplicates:

from itertools import chain

def is_duplicated_value(d):
    flat = list(chain.from_iterable(nested_values(d)))
    return len(flat) != len(set(flat))

Test:

print is_duplicated_value( {1:'a', 2:'b', 3:{1:'c', 2:'a'}} )
print is_duplicated_value( {1:'a', 2:'b', 3:{1:'c', 2:'d'}} )

Outputs:

True
False

Depending on your use and size of dictionaries etc you may want to recast these steps into a recursive function that adds each value to a set checking if each value is in the set before adding and returning True immediately or False if the dictionary is exhausted.

class Duplicated(ValueError): pass

def is_dup(d):
    values = set()
    def add(v):
        if isinstance(v, dict):
            map(add, v.values())
        else:
            if v in values:
                raise Duplicated
            else:
                values.add(v)
    try:
        add(d)
        return False
    except Duplicated:
        return True

Test:

print is_dup( {1:'a', 2:'b', 3:{1:'c', 2:'a'}} )
print is_dup( {1:'a', 2:'b', 3:{1:'c', 2:'d'}} )

Outputs:

True
False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM