I have a list of dicts that have a property that may be duplicate or similar to other dicts in the list. I'd like to use a similarity comparison function to uniquify this list. If any of the dicts have a value that is similar within a certain percentage of each other for the key "greeting", only one should be kept.
For example in this list, I want only one of the 'hello world' to remain:
list = [{"greeting":"HELLO WORLD!", ...}, {"greeting":"Hello Mars", ...}, {"greeting":"Hello World!!!", ...}, {"greeting":"hello world", ...}]
After uniquifying, the result would be:
list = [{"greeting":"HELLO WORLD!", ...}, {"greeting":"Hello Mars", ...}
All other dicts with similar greetings should be removed from the list. It doesn't matter which of the similar dicts are kept.
Here is a function by Nadia Alramli :
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
Using your function that determines uniqueness, you can do this:
import difflib
def similar(seq1, seq2):
return difflib.SequenceMatcher(a=seq1.lower(), b=seq2.lower()).ratio() > 0.9
def unique(mylist, keys):
temp = mylist[:]
for d in mylist:
temp.pop(0)
[d2.pop(i) for i in keys if d.has_key(i)
for d2 in temp if d2.has_key(i) and similar(d[i], d2[i])]
return mylist
note that this will modify your dictionaries in place:
mylist = [{"greeting":"HELLO WORLD!"}, {"greeting":"Hello Mars"}, {"greeting":"Hello World!!!"}, {"greeting":"hello world"}]
unique(mylist, ['greeting'])
print mylist
Output:
[{'greeting': 'HELLO WORLD!'}, {'greeting': 'Hello Mars'}, {}, {}]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.