[英]Merge dataframes on multiple columns with fuzzy match in Python
I have two example dataframes as follows:我有两个示例数据框,如下所示:
df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'},
'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'},
'GPA': {0: 3, 1: 3.5, 2: 4}})
I want to merge them together based on two columns Name
and Degree
with fuzzy matching method to drive out possible duplicates.我想根据两列
Name
和Degree
使用模糊匹配方法将它们合并在一起,以排除可能的重复项。 This is what I have realized with the help from reference here: Apply fuzzy matching across a dataframe column and save results in a new column这是我在参考here的帮助下意识到的: 在数据框列中应用模糊匹配并将结果保存在新列中
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Name'],
df2['Name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
compare.apply(metrics).unstack().idxmax().unstack(0)
compare.apply(metrics).unstack(0).idxmax().unstack(0)
Let's say fuzz.ratio of one's Name
and Degree
both are higher than 80 we consider they are same person.假设一个人的
Name
和Degree
fuzz.ratio 都高于 80,我们认为他们是同一个人。 And taken Name
and Degree
from df1 as default.并采取
Name
和Degree
从DF1为默认值。 How can I get a following expected result?如何获得以下预期结果? Thanks.
谢谢。
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
Name Degree Age GPA duplicatedName duplicatedDegree
0 John Masters 27.0 3.0 John S. Master
1 Bob Graduate 23.0 3.5 Bob K. Graduated
2 Shiela Graduate 21.0 NaN NaN Graduated
3 Frank Graduated NaN 4.0 NaN Graduate
I think ratio should be lower, for me working 60
.我认为比率应该更低,因为我工作
60
。 Create Series
with list comprehension
, filter by N
and get maximal value.使用
list comprehension
创建Series
,按N
过滤并获得最大值。 Last map
with fillna
and last merge
:最后
map
带有fillna
和最后一次merge
map
:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {tup: fuzz.ratio(*tup) for tup in
product(df1['Name'].tolist(), df2['Name'].tolist())}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
print (s1)
John S. John
Bob K. Bob
dtype: object
degrees = {tup: fuzz.ratio(*tup) for tup in
product(df1['Degree'].tolist(), df2['Degree'].tolist())}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated Graduate
Master Masters
dtype: object
df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
Name Degree Age GPA
0 John Masters 27.0 3.0
1 Bob Graduate 23.0 3.5
2 Shiela Graduate 21.0 NaN
3 Frank Graduate NaN 4.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.