[英]Merge dataframes on multiple columns with fuzzy match in Python
我有兩個示例數據框,如下所示:
df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'},
'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'},
'GPA': {0: 3, 1: 3.5, 2: 4}})
我想根據兩列Name
和Degree
使用模糊匹配方法將它們合並在一起,以排除可能的重復項。 這是我在參考here的幫助下意識到的: 在數據框列中應用模糊匹配並將結果保存在新列中
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Name'],
df2['Name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
compare.apply(metrics).unstack().idxmax().unstack(0)
compare.apply(metrics).unstack(0).idxmax().unstack(0)
假設一個人的Name
和Degree
fuzz.ratio 都高於 80,我們認為他們是同一個人。 並采取Name
和Degree
從DF1為默認值。 如何獲得以下預期結果? 謝謝。
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
Name Degree Age GPA duplicatedName duplicatedDegree
0 John Masters 27.0 3.0 John S. Master
1 Bob Graduate 23.0 3.5 Bob K. Graduated
2 Shiela Graduate 21.0 NaN NaN Graduated
3 Frank Graduated NaN 4.0 NaN Graduate
我認為比率應該更低,因為我工作60
。 使用list comprehension
創建Series
,按N
過濾並獲得最大值。 最后map
帶有fillna
和最后一次merge
map
:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {tup: fuzz.ratio(*tup) for tup in
product(df1['Name'].tolist(), df2['Name'].tolist())}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
print (s1)
John S. John
Bob K. Bob
dtype: object
degrees = {tup: fuzz.ratio(*tup) for tup in
product(df1['Degree'].tolist(), df2['Degree'].tolist())}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated Graduate
Master Masters
dtype: object
df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
Name Degree Age GPA
0 John Masters 27.0 3.0
1 Bob Graduate 23.0 3.5
2 Shiela Graduate 21.0 NaN
3 Frank Graduate NaN 4.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.