[英]How do I merge two dataframes by partial string match?
我有兩個英超聯賽球員的數據框:
df1:
ID Player Team Pos
1 Gabriel Dos Santos Arsenal DF
218 Conor Gallagher Crystal Palace MF
396 Gabriel Jesus Manchester City FW
df2:
ID name team minutes
15 Gabriel dos Santos Magalhães Arsenal 3063
18 Gabriel Martinelli Silva Arsenal 1855
27 Gabriel Fernando de Jesus Arsenal 1871
我想按名稱/播放器合並數據幀並保留 d1 和 d2 的所有行和列,即使名稱不在兩個數據幀中。 它看起來像這樣:
ID name team minutes ID Pos Team
15 Gabriel dos Santos Magalhães Arsenal 3063 1 DF Arsenal
18 Gabriel Martinelli Silva Arsenal 1855 NA NA NA
27 Gabriel Fernando de Jesus Arsenal 1871 396 FW Manchester City
NA Conor Gallagher NA NA 218 MF Crystal Palace
唯一的問題是 d1 中的名稱與 d2 中的名稱不完全匹配(將 d1 視為部分名稱或 d2 名稱的子字符串),並且 d1 中的某些名稱不在 d2 中(反之亦然)。
我這樣做了:
d2[d2['name'].apply(lambda player: d1['Player'].str.contains(player)).any(1)]
但它不起作用。 我應該怎么辦?
您可以使用包fuzzywuzzy
進行模糊匹配。
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
結果:
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
有關更多信息,請參閱此SO 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.