在Python中使用模糊匹配合并多列上的数据帧

Question

I have two example dataframes as follows:我有两个示例数据框，如下所示：

df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 
                   'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 
                   'Age': {0: 27, 1: 23, 2: 21}}) 

df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'}, 
                   'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'}, 
                   'GPA': {0: 3, 1: 3.5, 2: 4}})

I want to merge them together based on two columns Name and Degree with fuzzy matching method to drive out possible duplicates.我想根据两列Name和Degree使用模糊匹配方法将它们合并在一起，以排除可能的重复项。 This is what I have realized with the help from reference here: Apply fuzzy matching across a dataframe column and save results in a new column这是我在参考here的帮助下意识到的：在数据框列中应用模糊匹配并将结果保存在新列中

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

compare = pd.MultiIndex.from_product([df1['Name'],
                                      df2['Name']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])
compare.apply(metrics)

compare.apply(metrics).unstack().idxmax().unstack(0)

compare.apply(metrics).unstack(0).idxmax().unstack(0)

Let's say fuzz.ratio of one's Name and Degree both are higher than 80 we consider they are same person.假设一个人的Name和Degree fuzz.ratio 都高于 80，我们认为他们是同一个人。 And taken Name and Degree from df1 as default.并采取Name和Degree从DF1为默认值。 How can I get a following expected result?如何获得以下预期结果？ Thanks.谢谢。

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')

      Name     Degree   Age  GPA    duplicatedName   duplicatedDegree 
0     John    Masters  27.0  3.0         John S.          Master
1      Bob   Graduate  23.0  3.5          Bob K.         Graduated
2   Shiela   Graduate  21.0  NaN          NaN            Graduated
3    Frank  Graduated   NaN  4.0          NaN            Graduate

Answer 1

I think ratio should be lower, for me working 60 .我认为比率应该更低，因为我工作60 。 Create Series with list comprehension , filter by N and get maximal value.使用list comprehension创建Series ，按N过滤并获得最大值。 Last map with fillna and last merge :最后map带有fillna和最后一次merge map ：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from  itertools import product

N = 60
names = {tup: fuzz.ratio(*tup) for tup in 
           product(df1['Name'].tolist(), df2['Name'].tolist())}

s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]

print (s1)
John S.    John
Bob K.      Bob
dtype: object

degrees = {tup: fuzz.ratio(*tup) for tup in 
           product(df1['Degree'].tolist(), df2['Degree'].tolist())}

s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated    Graduate
Master        Masters
dtype: object

df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
     Name    Degree   Age  GPA
0    John   Masters  27.0  3.0
1     Bob  Graduate  23.0  3.5
2  Shiela  Graduate  21.0  NaN
3   Frank  Graduate   NaN  4.0

在Python中使用模糊匹配合并多列上的数据帧

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-01-05 09:00:39

在Python中使用模糊匹配合并多列上的数据帧

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-01-05 09:00:39

解决方案1
2 已采纳 2019-01-05 09:00:39