如何在Python中使用FuzzyWuzzy命名两个数据帧之间的匹配？

Question

I have df1 and df2. 我有df1和df2。 I want to use fuzzywuzzy to string match column A in df1 to column A in df2, and return an ID in column B of df2 based on a certain ratio match. 我想使用Fuzzywuzzy将df1中的A列与df2中的A列进行字符串匹配，并基于某个比率匹配返回df2的B列中的ID。

For example: 例如：

df1 looks like this: df1看起来像这样：

Name 名称

Sally sells Seashells 莎莉卖贝壳

df2 looks like this: df2看起来像这样：

Name | 姓名| ID ID

Sally slls sshells | Sally slls贝壳| 28904 28904

What I'm trying to do is compare everything in column A in df1 to find a match in column A in df2 and return the ID from column B in df2. 我想做的是比较df1中A列中的所有内容，以找到df2中A列中的匹配项，并从df2中的B列中返回ID。

I would like to be able to set the criteria of the fuzzy ratio. 我希望能够设置模糊比的标准。 For example: I only want it to return an ID if the ratio is above 50. 例如：如果比率大于50，我只希望它返回ID。

My current code: 我当前的代码：

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1=pd.read_csv('C:\\Users\\nkurdob\\Desktop\\Sheet1.csv')
df2=pd.read_csv('C:\\Users\\nkurdob\\Desktop\\Sheet2.csv')


for i in range(len(df1)):
    em = df1['A'][i]
    test = fuzz.partial_ratio(em, df2['A']) 
    if test > 50:
        print df1['A'][i]==df2['B']

Answer 1

Firstly thanks for the question, I have never used fuzzywuzzy before... 首先，感谢您的问题，我之前从未使用过Fuzzywuzzy ...

This is my take on your question. 这是我对你的问题的看法。

Here I am trying to match the name column in 2 data frames, and I will only show results which have a greater than 50 score. 在这里，我试图匹配2个数据框中的name列，并且我只会显示得分大于50的结果。

As I would then concat these results (or replace a column) I add blank values where there are no matches.... obviously you may or may not want to do this. 然后，我将合并这些结果（或替换列），因此在没有匹配项的地方添加空白值。...显然，您可能会或可能不想这样做。

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}

df1=pd.DataFrame.from_dict(d1,orient='index')
df2=pd.DataFrame.from_dict(d2,orient='index')

df1.columns=['Name']
df2.columns=['Name']

def match(Col1,Col2):
    overall=[]
    for n in Col1:
        result=[(fuzz.partial_ratio(n, n2),n2) 
                for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
               ]
        if len(result):
            result.sort()    
            print('result {}'.format(result))
            print("Best M={}".format(result[-1][1]))
            overall.append(result[-1][1])
        else:
            overall.append(" ")
    return overall

print(match(df1.Name,df2.Name))

When this is run you should see output like this. 运行该命令时，您应该会看到类似这样的输出。

result [(67, 'Tam'), (67, 'Tid')]
Best M=Tid
result [(67, 'Tid')]
Best M=Tid
result [(100, 'Sally')]
Best M=Sally
result [(100, 'Dicky')]
Best M=Dicky
['Tid', 'Tid', 'Sally', 'Dicky', ' ']

I am obviously only showing the intermediate results so I can demonstrate the value matching clause is working. 我显然只是显示中间结果，因此我可以证明值匹配子句正在工作。

I then sort the list of tuples (as they were stored with score-then-value order), take the last one (you can reverse the sort and take the top value up to you), I then take the 2nd element ([1]) from the tuple. 然后，我对元组的列表进行排序（因为它们是按照得分-然后-值顺序存储的），进行最后一个（您可以对排序进行逆向运算，并根据您的最高价值进行排序），然后选择第二个元素（[1 ]）从元组。

This should work for any 2 string Columns, but I have not tested this. 这应该适用于任何2个字符串列，但我尚未对此进行测试。

如何在Python中使用FuzzyWuzzy命名两个数据帧之间的匹配？

问题描述

1 个解决方案

解决方案1
0 2017-09-03 05:50:48

如何在Python中使用FuzzyWuzzy命名两个数据帧之间的匹配？

问题描述

1 个解决方案

解决方案1 0 2017-09-03 05:50:48

解决方案1
0 2017-09-03 05:50:48