部分匹配使用模糊模糊比較來自不同數據幀的 2 列

Question

我想比較這個數據框 df1 ：

                         Product  Price
0               Waterproof Liner     40
1                   Phone Tripod     50
2               Waterproof Pants      0
3             baby Kids play Mat    985
4               Hiking BACKPACKS     34
5                security Camera    160

使用 df2 如下所示：

                                     Product      Id
0                    Home Security IP Camera  508760
1         Hiking Backpacks – Spring Products  287950
2                   Waterproof Eyebrow Liner  678897
3          Waterproof Pants – Winter Product  987340
4  Baby Kids Water Play Mat – Summer Product  111500

我想將df1 中的Product 列與 Product df2進行比較。 為了找到產品的好id 。 如果相似度 < 80 ，它將在 ID 字段中放置“刪除”注意： df1 和 df2中產品列的文本不是 100% 匹配的任何人都可以幫我解決這個問題，或者我如何使用模糊 wazzy 來獲取好身份證？

這是我的代碼

import pandas as pd
from fuzzywuzzy import process

data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}

data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
        'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
        'Id': [508760,287950,678897,987340,111500],}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
                                   .tolist(), columns=['Product1',"match_comp", "Id"])

我得到了什么：

                                    Product1  match_comp  Id
0                   Waterproof Eyebrow Liner          86   2
1                   Waterproof Eyebrow Liner          50   2
2          Waterproof Pants – Winter Product          90   3
3  Baby Kids Water Play Mat – Summer Product          86   4
4         Hiking Backpacks – Spring Products          90   1
5                    Home Security IP Camera          86   0

什么是預期：

           Product  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760

Answer 1

您可以創建一個包裝函數：

def extract(s):
    name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
    if score < 80:
        return 'Remove'
    return df2.set_index('Product2').loc[name, 'Id']
    

df1['ID'] = df1["Product1"].apply(extract)

輸出：

             Product1  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760

注意。 輸出並不完全符合您的預期，您必須解釋為什么應該刪除第 4/5 行

部分匹配使用模糊模糊比較來自不同數據幀的 2 列

問題描述

1 個解決方案

解決方案1
1 已采納 2021-07-27 11:28:52

部分匹配使用模糊模糊比較來自不同數據幀的 2 列

問題描述

1 個解決方案

解決方案1 1 已采納 2021-07-27 11:28:52

解決方案1
1 已采納 2021-07-27 11:28:52