[英]merge 2 dataframes based on partial string-match between columns
I have two data frames df1 and df2 as shown below:我有两个数据框 df1 和 df2 ,如下所示:
Df1: DF1:
movie correct_id
0 birdman N/A
1 avengers: endgame N/A
2 deadpool N/A
3 once upon deadpool N/A
Df2: data frame of reference Df2:数据参考框架
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
Expected Result:预期结果:
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool N/A
3 once upon deadpool 1
Please how do I merge two dataframes based on partial string match?请问如何根据部分字符串匹配合并两个数据框?
NB : The movie's name not exactly the same注意:电影名称不完全相同
From a previous post .从以前的帖子。
Input data:输入数据:
>>> df1
movie correct_id
0 birdman NaN
1 avengers: endgame NaN
2 deadpool NaN
3 once upon deadpool NaN
>>> df2
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
A bit of fuzzy logic:有点模糊的逻辑:
from fuzzywuzzy import process
dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
.tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
movie ratio best_id
0 birdmans 93 0
1 The avengers: endgame: endgame 90 1
2 once upon a deadpool 90 3
3 once upon a deadpool 95 3
The index of dfm
is the index of df1
rather than the column best_id
is the index of df2
. dfm
的索引是df1
的索引,而不是best_id
列是df2
的索引。 Now you can update your first dataframe:现在您可以更新您的第一个 dataframe:
THRESHOLD = 90 # adjust this number
ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool <NA>
3 once upon deadpool 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.