Apply fuzzy matching across a dataframe column and save results in a new column

Question

I have two data frames with each having a different number of rows. Below is a couple rows from each data set

df1 =
     Company                                   City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

and

df2 = 
     FDA Company                    FDA City    FDA State   FDA ZIP
     LACKEY SHEET METAL             St. Louis   MO          63102
     PRIMUS STERILIZER COMPANY LLC  Great Bend  KS          67530
     HELGET GAS PRODUCTS INC        Omaha       NE          68127
     ORTHOQUEST LLC                 La Vista    NE          68128

I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1) . My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name. I want to store that in a new column. For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data . It results would look like

combined_data =
     Company                                   City         State  ZIP      FDA Company                     FDA City    FDA State   FDA ZIP     fuzzy.token_sort_ratio    match    fuzzy.ratio         match
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101    LACKEY SHEET METAL              St. Louis   MO          63102       LACKEY SHEET METAL        100      LACKEY SHEET METAL  100
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102    PRIMUS STERILIZER COMPANY LLC   Great Bend  KS          67530
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102    HELGET GAS PRODUCTS INC         Omaha       NE          68127
     LACKEY SHEET METAL                        St. Louis    MO     63102    ORTHOQUEST LLC                  La Vista    NE          68128

I tried doing

combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1)

But got an error because the lengths of the columns are different.

I am stumped. How I can accomplish this?

Answer 1

I couldn't tell what you were doing. This is how I would do it.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Create a series of tuples to compare:

compare = pd.MultiIndex.from_product([df1['Company'],
                                      df2['FDA Company']]).to_series()

Create a special function to calculate fuzzy metrics and return a series.

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

Apply metrics to the compare series

compare.apply(metrics)

There are bunch of ways to do this next part:

Get closest matches to each row of df1

compare.apply(metrics).unstack().idxmax().unstack(0)

Get closest matches to each row of df2

compare.apply(metrics).unstack(0).idxmax().unstack(0)

Answer 2

I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

Vesrion Compatibility:

pandas version :: 1.1.5 ,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy metric explanation: link text

Output from code:

Apply fuzzy matching across a dataframe column and save results in a new column

Question

1 answers

solution1
16 ACCPTED 2016-07-25 23:03:38

solution2
0 2021-12-18 07:10:34

Apply fuzzy matching across a dataframe column and save results in a new column

Question

1 answers

solution1 16 ACCPTED 2016-07-25 23:03:38

solution2 0 2021-12-18 07:10:34

solution1
16 ACCPTED 2016-07-25 23:03:38

solution2
0 2021-12-18 07:10:34