How to uniquefy a list of dicts based on percentage similarity of a value in the dicts

Question

I have a list of dicts that have a property that may be duplicate or similar to other dicts in the list. I'd like to use a similarity comparison function to uniquify this list. If any of the dicts have a value that is similar within a certain percentage of each other for the key "greeting", only one should be kept.

For example in this list, I want only one of the 'hello world' to remain:

list = [{"greeting":"HELLO WORLD!", ...}, {"greeting":"Hello Mars", ...}, {"greeting":"Hello World!!!", ...}, {"greeting":"hello world", ...}]

After uniquifying, the result would be:

list = [{"greeting":"HELLO WORLD!", ...}, {"greeting":"Hello Mars", ...}

All other dicts with similar greetings should be removed from the list. It doesn't matter which of the similar dicts are kept.

Here is a function by Nadia Alramli :

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

Answer 1

Using your function that determines uniqueness, you can do this:

import difflib

def similar(seq1, seq2):
    return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9

def unique(mylist, keys):
    temp = mylist[:]
    for d in mylist:
        temp.pop(0)
        [d2.pop(i) for i in keys if d.has_key(i)
         for d2 in temp if d2.has_key(i) and similar(d[i], d2[i])] 
    return mylist

note that this will modify your dictionaries in place:

mylist = [{"greeting":"HELLO WORLD!"}, {"greeting":"Hello Mars"}, {"greeting":"Hello World!!!"}, {"greeting":"hello world"}]
unique(mylist, ['greeting'])

print mylist

Output:

[{'greeting': 'HELLO WORLD!'}, {'greeting': 'Hello Mars'}, {}, {}]

How to uniquefy a list of dicts based on percentage similarity of a value in the dicts

Question

1 answers

solution1
0 ACCPTED 2012-06-14 20:53:37

How to uniquefy a list of dicts based on percentage similarity of a value in the dicts

Question

1 answers

solution1 0 ACCPTED 2012-06-14 20:53:37

solution1
0 ACCPTED 2012-06-14 20:53:37