[英]merge 2 dataframes based on partial string-match between columns
我有兩個數據框 df1 和 df2 ,如下所示:
DF1:
movie correct_id
0 birdman N/A
1 avengers: endgame N/A
2 deadpool N/A
3 once upon deadpool N/A
Df2:數據參考框架
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
預期結果:
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool N/A
3 once upon deadpool 1
請問如何根據部分字符串匹配合並兩個數據框?
注意:電影名稱不完全相同
從以前的帖子。
輸入數據:
>>> df1
movie correct_id
0 birdman NaN
1 avengers: endgame NaN
2 deadpool NaN
3 once upon deadpool NaN
>>> df2
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
有點模糊的邏輯:
from fuzzywuzzy import process
dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
.tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
movie ratio best_id
0 birdmans 93 0
1 The avengers: endgame: endgame 90 1
2 once upon a deadpool 90 3
3 once upon a deadpool 95 3
dfm
的索引是df1
的索引,而不是best_id
列是df2
的索引。 現在您可以更新您的第一個 dataframe:
THRESHOLD = 90 # adjust this number
ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool <NA>
3 once upon deadpool 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.