简体   繁体   English

跨数据框列应用模糊匹配并将结果保存在新列中

[英]Apply fuzzy matching across a dataframe column and save results in a new column

I have two data frames with each having a different number of rows.我有两个数据框,每个数据框都有不同的行数。 Below is a couple rows from each data set下面是每个数据集中的几行

df1 =
     Company                                   City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

and

df2 = 
     FDA Company                    FDA City    FDA State   FDA ZIP
     LACKEY SHEET METAL             St. Louis   MO          63102
     PRIMUS STERILIZER COMPANY LLC  Great Bend  KS          67530
     HELGET GAS PRODUCTS INC        Omaha       NE          68127
     ORTHOQUEST LLC                 La Vista    NE          68128

I joined them side by side using combined_data = pandas.concat([df1, df2], axis = 1) .我使用combined_data = pandas.concat([df1, df2], axis = 1)并排加入了它们。 My next goal is to compare each string under df1['Company'] to each string under in df2['FDA Company'] using several different matching commands from the fuzzy wuzzy module and return the value of the best match and its name.我的下一个目标是使用来自fuzzy wuzzy模块的几个不同匹配命令将df1['Company']下的每个字符串与df2['FDA Company']下的每个字符串进行比较,并返回最佳匹配的值及其名称。 I want to store that in a new column.我想将它存储在一个新列中。 For example if I did the fuzz.ratio and fuzz.token_sort_ratio on LACKY SHEET METAL in df1['Company'] to df2['FDA Company'] it would return that the best match was LACKY SHEET METAL with a score of 100 and this would then be saved under a new column in combined data .例如,如果我在df1['Company']df2['FDA Company'] LACKY SHEET METALLACKY SHEET METAL fuzz.ratiofuzz.token_sort_ratio ,它会返回最佳匹配是LACKY SHEET METAL ,得分为100 ,这然后将保存在combined data的新列下。 It results would look like结果看起来像

combined_data =
     Company                                   City         State  ZIP      FDA Company                     FDA City    FDA State   FDA ZIP     fuzzy.token_sort_ratio    match    fuzzy.ratio         match
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101    LACKEY SHEET METAL              St. Louis   MO          63102       LACKEY SHEET METAL        100      LACKEY SHEET METAL  100
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102    PRIMUS STERILIZER COMPANY LLC   Great Bend  KS          67530
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102    HELGET GAS PRODUCTS INC         Omaha       NE          68127
     LACKEY SHEET METAL                        St. Louis    MO     63102    ORTHOQUEST LLC                  La Vista    NE          68128

I tried doing我试着做

combined_data['name_ratio'] = combined_data.apply(lambda x: fuzz.ratio(x['Company'], x['FDA Company']), axis = 1) 

But got an error because the lengths of the columns are different.但是由于列的长度不同而出错。

I am stumped.我难住了。 How I can accomplish this?我怎样才能做到这一点?

I couldn't tell what you were doing.我说不清你在做什么。 This is how I would do it.这就是我要做的。

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

Create a series of tuples to compare:创建一系列要比较的元组:

compare = pd.MultiIndex.from_product([df1['Company'],
                                      df2['FDA Company']]).to_series()

Create a special function to calculate fuzzy metrics and return a series.创建一个特殊的函数来计算模糊度量并返回一个系列。

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

Apply metrics to the compare seriesmetrics应用于compare系列

compare.apply(metrics)

在此处输入图片说明

There are bunch of ways to do this next part:有很多方法可以完成下一部分:

Get closest matches to each row of df1获取与df1每一行最接近的匹配

compare.apply(metrics).unstack().idxmax().unstack(0)

在此处输入图片说明

Get closest matches to each row of df2获取与df2每一行最接近的匹配项

compare.apply(metrics).unstack(0).idxmax().unstack(0)

在此处输入图片说明

I've implemented the code in Python with parallel processing, which will be much faster than serial computation.我已经通过并行处理在 Python 中实现了代码,这将比串行计算快得多。 Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel.此外,在模糊度量分数超过阈值的情况下,只有那些计算是并行执行的。 Please see the link below for the code:请参阅以下链接以获取代码:

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

Vesrion Compatibility:版本兼容性:

pandas version :: 1.1.5 ,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy metric explanation: link text Fuzzywuzzy 度量解释: 链接文本

Output from code:代码输出: 在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在具有多个列表的 dataframe 列中应用模糊匹配并将结果保存在新列中 - How to apply fuzzy matching across a dataframe column with multiple lists and save results in a new column 如何在pandas groupby对象上应用函数并将结果保存回父数据帧的新列? - How to apply a function on a pandas groupby object and save the results back into a new column of the parent dataframe? 列内的模糊匹配 - Fuzzy matching inside a column 模糊CSV列匹配 - Fuzzy CSV column matching Pandas Dataframe 问题:应用函数添加带有结果的新列 - Pandas Dataframe Question: Apply function add new column with results 在数据框的两列上应用模糊匹配得分 - Apply fuzzy matching score at two columns of a dataframe 在一列中模糊匹配字符串,并使用Fuzzywuzzy创建新的数据框 - Fuzzy match strings in one column and create new dataframe using fuzzywuzzy 将正则表达式应用于spark数据框的每一行,并将其另存为同一数据框中的新列 - Apply regex to every row of a spark dataframe and save it as a new column in the same dataframe python中数据帧中列中的模糊匹配 - Fuzzy match in a column in a dataframe in python Python:跨类别应用 function 并将结果保存到新列 - Python: apply function across categories and save results to new columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM