简体   繁体   English

基于列之间的部分字符串匹配合并 2 个数据帧

[英]merge 2 dataframes based on partial string-match between columns

I have two data frames df1 and df2 as shown below:我有两个数据框 df1 和 df2 ,如下所示:

Df1: DF1:

                  movie    correct_id
0              birdman        N/A
1     avengers: endgame        N/A
2              deadpool        N/A
3  once upon deadpool        N/A

Df2: data frame of reference Df2:数据参考框架

          movie              correct_id
0               birdmans          4
1  The avengers: endgame          2
2               The King          3
3   once upon a deadpool          1

Expected Result:预期结果:

            movie    correct_id
0              birdman        4
1     avengers: endgame       2
2             deadpool       N/A
3   once upon deadpool        1

Please how do I merge two dataframes based on partial string match?请问如何根据部分字符串匹配合并两个数据框?

NB : The movie's name not exactly the same注意:电影名称不完全相同

From a previous post .以前的帖子

Input data:输入数据:

>>> df1
                movie  correct_id
0             birdman         NaN
1   avengers: endgame         NaN
2            deadpool         NaN
3  once upon deadpool         NaN

>>> df2
                   movie  correct_id
0               birdmans           4
1  The avengers: endgame           2
2               The King           3
3   once upon a deadpool           1

A bit of fuzzy logic:有点模糊的逻辑:

from fuzzywuzzy import process

dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
                               .tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
                            movie  ratio  best_id
0                        birdmans     93        0
1  The avengers: endgame: endgame     90        1
2            once upon a deadpool     90        3
3            once upon a deadpool     95        3

The index of dfm is the index of df1 rather than the column best_id is the index of df2 . dfm的索引是df1的索引,而不是best_id列是df2的索引。 Now you can update your first dataframe:现在您可以更新您的第一个 dataframe:

THRESHOLD = 90  # adjust this number

ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
                movie  correct_id
0             birdman           4
1   avengers: endgame           2
2            deadpool        <NA>
3  once upon deadpool           1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM