簡體   English   中英

部分匹配使用模糊模糊比較來自不同數據幀的 2 列

[英]partial match to compare 2 columns from different dataframes using fuzzy wuzzy

我想比較這個數據框 df1

                         Product  Price
0               Waterproof Liner     40
1                   Phone Tripod     50
2               Waterproof Pants      0
3             baby Kids play Mat    985
4               Hiking BACKPACKS     34
5                security Camera    160

使用 df2 如下所示:

                                     Product      Id
0                    Home Security IP Camera  508760
1         Hiking Backpacks – Spring Products  287950
2                   Waterproof Eyebrow Liner  678897
3          Waterproof Pants – Winter Product  987340
4  Baby Kids Water Play Mat – Summer Product  111500

我想將df1 中的Product 列與 Product df2進行比較。 為了找到產品的好id 如果相似度 < 80 ,它將在 ID 字段中放置“刪除”注意: df1 和 df2中產品列的文本不是 100% 匹配的任何人都可以幫我解決這個問題,或者我如何使用模糊 wazzy 來獲取好身份證?

這是我的代碼

import pandas as pd
from fuzzywuzzy import process

data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}

data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
        'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
        'Id': [508760,287950,678897,987340,111500],}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
                                   .tolist(), columns=['Product1',"match_comp", "Id"])

我得到了什么:

                                    Product1  match_comp  Id
0                   Waterproof Eyebrow Liner          86   2
1                   Waterproof Eyebrow Liner          50   2
2          Waterproof Pants – Winter Product          90   3
3  Baby Kids Water Play Mat – Summer Product          86   4
4         Hiking Backpacks – Spring Products          90   1
5                    Home Security IP Camera          86   0

什么是預期:

           Product  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760

您可以創建一個包裝函數:

def extract(s):
    name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
    if score < 80:
        return 'Remove'
    return df2.set_index('Product2').loc[name, 'Id']
    

df1['ID'] = df1["Product1"].apply(extract)

輸出:

             Product1  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760

注意。 輸出並不完全符合您的預期,您必須解釋為什么應該刪除第 4/5 行

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM