簡體   English   中英

如何通過部分字符串匹配合並兩個數據幀?

[英]How do I merge two dataframes by partial string match?

我有兩個英超聯賽球員的數據框:

df1:

ID      Player              Team             Pos
1       Gabriel Dos Santos  Arsenal          DF
218     Conor Gallagher     Crystal Palace   MF
396     Gabriel Jesus       Manchester City  FW

df2:

ID  name                            team     minutes
15  Gabriel dos Santos Magalhães    Arsenal  3063
18  Gabriel Martinelli Silva        Arsenal  1855
27  Gabriel Fernando de Jesus       Arsenal  1871

我想按名稱/播放器合並數據幀並保留 d1 和 d2 的所有行和列,即使名稱不在兩個數據幀中。 它看起來像這樣:

ID  name                            team     minutes  ID   Pos  Team
15  Gabriel dos Santos Magalhães    Arsenal  3063     1    DF   Arsenal
18  Gabriel Martinelli Silva        Arsenal  1855     NA   NA   NA
27  Gabriel Fernando de Jesus       Arsenal  1871     396  FW   Manchester City
NA  Conor Gallagher                 NA       NA       218  MF   Crystal Palace

唯一的問題是 d1 中的名稱與 d2 中的名稱不完全匹配(將 d1 視為部分名稱或 d2 名稱的子字符串),並且 d1 中的某些名稱不在 d2 中(反之亦然)。

我這樣做了:

d2[d2['name'].apply(lambda player: d1['Player'].str.contains(player)).any(1)]

但它不起作用。 我應該怎么辦?

您可以使用包fuzzywuzzy進行模糊匹配。

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    :param df_1: the left table to join
    :param df_2: the right table to join
    :param key1: key column of the left table
    :param key2: key column of the right table
    :param threshold: how close the matches should be to return a match, based on Levenshtein distance
    :param limit: the amount of matches that will get returned, these are sorted high to low
    :return: dataframe with boths keys and matches
    """
    s = df_2[key2].tolist()
    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
    
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)

結果:

          Key       matches
0       Apple          Aple
1      Banana      Bannanna
2      Orange          Orag
3  Strawberry  Straw, Berry

有關更多信息,請參閱此SO

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM