I have more than 500k records of city names, But data is not correctly typed for example word AHMADNAGAR
is typed in ways below,
1. AEHMADNAGAR
2. AHEMADNAGR
3. AHMAD NAGAR
4. AHMADNAGGAR
This is an example of only one city, I have to scan more than 500k records & want to find similar words but not exactly similar.
I created the .txt
file and I'm sharing the link, where I included 17K Cities, please see file Here
What I tried?
from difflib import get_close_matches
def closeMatches(patterns, word):
print(get_close_matches(word, patterns))
citylist=['AHMADNAGAR','XYZ','AEHMADNAGAR','AHEMADNAGR','AHMADNAGAR','AHMADNAGGAR','ABC','test','test2']
for city in citylist:
closeMatches(patterns, city)
Expected output: (as we're passing city
runtime it should print similar values but not exactly similar, even I already removed similar values, We don't have any duplicates)
example output of one city ( we have 500K cities, please check file, I included some of them there )
AHMADNAGAR
AEHMADNAGAR
AHEMADNAGR
AHMADNAGAR
AHMADNAGGAR
The problem here is, can not pass city
here to create pattern
manually & another thing is it is not showing all variations.
I learned from my friend that we can use regex
, but how? Is there any way to create regex runtime
and match it with all records
Just want to get the list of similar cities
The 2nd argument for get_close_matches()
is a list, if you're just trying to get the close matches, you could run:
from difflib import get_close_matches
city_list = ['AHMADNAGAR','AEHMADNAGAR','AHEMADNAGR','AHMAD NAGAR', 'AHMADNAGGAR','test','test2']
close_matches = get_close_matches('AHMADNAGAR', city_list)
for close_match in close_matches:
print(close_match)
There is no need for you to implement your own loop or wrap get_close_matches
in another function. Just provide the name of the city that you want to match against ( 'AHMADNAGAR'
) and the list of possible matches to the get_close_matches
function. It defaults to 3, so specify a higher n
if you want more.
>>> from difflib import get_close_matches
>>> citylist=['AHMADNAGAR','XYZ','AEHMADNAGAR','AHEMADNAGR','AHMADNAGAR','AHMADNAGGAR','ABC','test','test2']
>>> get_close_matches('AHMADNAGAR', citylist, n=len(citylist))
['AHMADNAGAR', 'AHMADNAGAR', 'AHMADNAGGAR', 'AEHMADNAGAR', 'AHEMADNAGR']
Note that the result is sorted by similarity, so the exact match is first, followed by the closest match, etc.
Documentation for difflib
is here: https://docs.python.org/3/library/difflib.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.