I'm trying to filter user duplicates from a database. There's a unique user_id and the full name. I'm comparing the names using difflib.get_close_matches
Now as the names aren't unique, I created a dictionary with the user_id as key and the name as related object. But comparing names like this requires to iterate over the full dictionary every time and accessing the names is kind of a pain.
I was thinking about just using a 2d-array (list) as it's quicker to get the data, but I don't really want to work with indexes (Imho it's a pretty ugly way to deal with the problem). Any suggestions on how to solve this issue in an elegant way are highly appreciated. I'm still learning python btw.
Edit: The dataset looks like this:
user_id name 4050 John Doe 4059 John doe 4052 John Doe1 9083 Napoleon Bonnaparte 7842 Mad Max 4085 Johnn Doe 4084 Alice Spring 5673 Fredy Krüger 4092 Alice Spring1 4042 Alice k Spring 4122 Max miller
In the end I need to find the user_ids for the names which are similary, that's why I am using difflib.get_close_matches
So the list should look like the following in the end:
user_id name 4050 John Doe 4059 John doe 4052 John Doe1 4085 Johnn Doe 4084 Alice Spring 4092 Alice Spring1 4042 Alice k Spring
It looks to me like you really want to go from name to id and not the other way around. The way to tackle the issue of full names not necessarily being unique is to have a list of user_ids against each full name. So, reverse your dictionary that has the user_id as key and the name as related object. Like this:
from collections import defaultdict
lookup = defaultdict(list)
for id, name in mydict.items():
lookup[name].append(id)
Now build a dict of close matches using difflib.get_close_matches()
: key is full name, value is a list of potentially duplicate full names. It appears from your question that you already know how to do that.
Loop through your dict of close matches and report full name and id:
for name, duplicate_list in close_matches.items():
for id in lookup[name]:
print (id, name)
for duplicate in duplicate_list:
for id in lookup[duplicate]:
if duplicate != name:
print(id, duplicate, "possible duplicate of", name)
I've put a print()
call here for simplicity but you will almost certainly want to assemble the results into a list for further processing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.