简体   繁体   中英

How to find similar values from list but not exact?

I have more than 500k records of city names, But data is not correctly typed for example word AHMADNAGAR is typed in ways below,

 1. AEHMADNAGAR
 2. AHEMADNAGR
 3. AHMAD NAGAR
 4. AHMADNAGGAR

This is an example of only one city, I have to scan more than 500k records & want to find similar words but not exactly similar.

I created the .txt file and I'm sharing the link, where I included 17K Cities, please see file Here

What I tried?

from difflib import get_close_matches
        
def closeMatches(patterns, word):
    print(get_close_matches(word, patterns))

    
citylist=['AHMADNAGAR','XYZ','AEHMADNAGAR','AHEMADNAGR','AHMADNAGAR','AHMADNAGGAR','ABC','test','test2']
     for city in citylist:  
         closeMatches(patterns, city)

Expected output: (as we're passing city runtime it should print similar values but not exactly similar, even I already removed similar values, We don't have any duplicates)

example output of one city ( we have 500K cities, please check file, I included some of them there )

    AHMADNAGAR
    AEHMADNAGAR
    AHEMADNAGR
    AHMADNAGAR
    AHMADNAGGAR

The problem here is, can not pass city here to create pattern manually & another thing is it is not showing all variations.

I learned from my friend that we can use regex , but how? Is there any way to create regex runtime and match it with all records

Just want to get the list of similar cities

The 2nd argument for get_close_matches() is a list, if you're just trying to get the close matches, you could run:

from difflib import get_close_matches
    
city_list = ['AHMADNAGAR','AEHMADNAGAR','AHEMADNAGR','AHMAD NAGAR', 'AHMADNAGGAR','test','test2']

close_matches = get_close_matches('AHMADNAGAR', city_list)
for close_match in close_matches:
    print(close_match)

There is no need for you to implement your own loop or wrap get_close_matches in another function. Just provide the name of the city that you want to match against ( 'AHMADNAGAR' ) and the list of possible matches to the get_close_matches function. It defaults to 3, so specify a higher n if you want more.

>>> from difflib import get_close_matches
>>> citylist=['AHMADNAGAR','XYZ','AEHMADNAGAR','AHEMADNAGR','AHMADNAGAR','AHMADNAGGAR','ABC','test','test2']
>>> get_close_matches('AHMADNAGAR', citylist, n=len(citylist))
['AHMADNAGAR', 'AHMADNAGAR', 'AHMADNAGGAR', 'AEHMADNAGAR', 'AHEMADNAGR']

Note that the result is sorted by similarity, so the exact match is first, followed by the closest match, etc.

Documentation for difflib is here: https://docs.python.org/3/library/difflib.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM