簡體   English   中英

模糊匹配列和合並/連接數據框

[英]Fuzzy match columns and merge/join dataframes

我正在嘗試將 2 個數據幀與多個列合並,每個列基於每個列中的一個列的匹配值。 @Erfan 的這段代碼在模糊匹配目標列方面做得很好,但是有沒有辦法攜帶 rest 列。 https://stackoverflow.com/a/56315491/12802642

Dataframe

df1 = pd.DataFrame({'Key':['Apple Souce', 'Banana', 'Orange', 'Strawberry', 'John tabel']})
df2 = pd.DataFrame({'Key':['Aple suce', 'Mango', 'Orag','Jon table', 'Straw', 'Bannanna', 'Berry'],
                    'Key23':['1', '2', '3','4', '5', '6', '7'})

匹配來自@Erfan 的 function,如上面的鏈接所述

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

撥打電話 function

df = fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80, limit=1)
df.sort_values(by='Key',ascending=True).reset_index()

結果

index   Key            matches
0       Apple Souce    Aple suce
1       Banana         Bannanna
2       John tabel  
3       Orange  
4       Strawberry     Straw

期望的結果

index   Key            matches       Key23
0       Apple Souce    Aple suce     1
1       Banana         Bannanna      6
2       John tabel                   
3       Orange                       
4       Strawberry     Straw         5

對於那些需要這個的人。 這是我想出的解決方案。
merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)
從那里你可以刪除不必要的或重復的列並得到一個干凈的結果,如下所示:
clean = merge.drop(['matches', 'Key_y'], axis=1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM