简体   繁体   中英

How to fuzzy match two lists in Python

I have two lists: ref_list and inp_list . How can one make use of FuzzyWuzzy to match the input list from the reference list?

inp_list = pd.DataFrame(['ADAMS SEBASTIAN',  'HAIMBILI SEUN',  'MUTESI 
                          JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA', 
                          'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON', 
                          'PETERS MARIO', 'SHEETEKELA  MATT NICKY'],
                          columns =['Names'])



ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN 
                         KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA  MATT NICKY MBILI'], columns = 
                        ['Names']) 

After some research, I modified some codes I found on the inte.net. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.

Below are the codes, first a helper function was defined.

def match_term(term, inp_list, min_score=0):
    # -1 score in case I don't get any matches
    max_score = -1
    
    # return empty for no match 
    max_name = ''
    
    # iterate over all names in the other
    for term2 in inp_list:
        # find the fuzzy match score
        score = fuzz.token_sort_ratio(term, term2)
    
        # checking if I am above my threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = term2
            max_score = score
    return (max_name, max_score)


# list for dicts for easy dataframe creation
dict_list = []

#iterating over the sales file
for name in inp_list:
    #use the defined function above to find the best match, also set the threshold to a chosen #
    match = match_term(name, ref_list, 94)
    
    #new dict for storing data
    dict_ = {}
    dict_.update({'passenger_name': name})
    dict_.update({'match_name': match[0]})
    dict_.update({'score': match[1]})
    
    dict_list.append(dict_)

Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?

You can try to vectorized the operations instead of evaluate the scores in a loop.

Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list . Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1) . Finally you'll get the best match name and score in ref_list for each name in inp_list .

The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy , you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:

import random, string, time
import pandas as pd
from string_grouper import match_strings

alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1

random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
                            for i in range(6)) for j in range(5000)]
                
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)

t_1 = time.time()
matches = match_strings(series_1, series_2,
                        min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)

It takes less than one second to do 25.000.000 comparisons: For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that

"Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop".

To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM