繁体   English   中英

如何计算两个dataframe之间的匹配百分比差异

[英]How to calculate matching percentage difference between two dataframe

我正在寻找两个数据框之间的百分比差异。 我试过使用fuzzywuzzy,但没有得到预期的output。

假设我有 2 个数据框,每个数据框有 3 列,我想找到这 2 个数据框之间的匹配百分比。

df1

score   id_number       company_name      company_code   
200      IN2231D           AXN pvt Ltd        IN225                 
450      UK654IN        Aviva Intl Ltd        IN115                 
650      SL1432H   Ship Incorporations        CZ555                  
350      LK0678G  Oppo Mobiles pvt ltd        PQ795                 
590      NG5678J             Nokia Inc        RS885                 
250      IN2231D           AXN pvt Ltd        IN215                 

df2

  QR_score     Identity_No       comp_name      comp_code      match_acc   
    200.00      IN2231D           AXN pvt Inc        IN225                 
    420.0       UK655IN        Aviva Intl Ltd        IN315                 
    350.35      SL2252H              Ship Inc        CK555                  
    450.0       LK9978G  Oppo Mobiles pvt ltd        PRS95                 
    590.5       NG5678J             Nokia Inc        RS885                 
    250.0       IN5531D           AXN pvt Ltd        IN215 

我正在使用的代码:

df1 = df[['score','id_number','company_code']]
df2 = df[['QR_score','identity_No','comp_code']]

for idx, row1 in df1.iterrows():
   for idx2, row2 in df2.iterrows():
      df2['match_acc'] =   

假设如果 dataframe 中的第一行匹配 75%,那么它将列在 df2['match_acc'] 列中,每行都遵循相同的规则。

IIUC 重命名列以匹配,然后在轴 1 上使用eq + mean

df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100

df2

   QR_score Identity_No             comp_name comp_code  match_acc
0    200.00     IN2231D           AXN pvt Inc     IN225       75.0
1    420.00     UK655IN        Aviva Intl Ltd     IN315       25.0
2    350.35     SL2252H              Ship Inc     CK555        0.0
3    450.00     LK9978G  Oppo Mobiles pvt ltd     PRS95       25.0
4    590.50     NG5678J             Nokia Inc     RS885       75.0
5    250.00     IN5531D           AXN pvt Ltd     IN215       75.0

完整的工作示例

import pandas as pd

df1 = pd.DataFrame({
    'score': [200, 450, 650, 350, 590, 250],
    'id_number': ['IN2231D', 'UK654IN', 'SL1432H', 'LK0678G', 'NG5678J',
                  'IN2231D'],
    'company_name': ['AXN pvt Ltd', 'Aviva Intl Ltd', 'Ship Incorporations',
                     'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
    'company_code': ['IN225', 'IN115', 'CZ555', 'PQ795', 'RS885', 'IN215']
})

df2 = pd.DataFrame({
    'QR_score': [200.00, 420.0, 350.35, 450.0, 590.5, 250.0],
    'Identity_No': ['IN2231D', 'UK655IN', 'SL2252H', 'LK9978G', 'NG5678J',
                    'IN5531D'],
    'comp_name': ['AXN pvt Inc', 'Aviva Intl Ltd', 'Ship Inc',
                  'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
    'comp_code': ['IN225', 'IN315', 'CK555', 'PRS95', 'RS885', 'IN215']
})

df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100
print(df2)

假设一个单元格的相似性应该由类似的东西来评估, fuzzywuzzyvectorize任何fuzzywuzzy function 以应用于所有单元格,并从结果中创建一个新的 dataframe。 fuzzywuzzy只能处理字符串,因此请分别处理object类型的列和非对象。

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

# Make Column Names Match
df1.columns = df2.columns
# Select string (object) columns
t1 = df1.select_dtypes(include='object')
t2 = df2.select_dtypes(include='object')
# Apply fuzz.ratio to every cell of both frames
obj_similarity = pd.DataFrame(np.vectorize(fuzz.ratio)(t1, t2), 
                              columns=t1.columns,
                              index=t1.index)
# Use non-object similarity with eq
other_similarity = df1.select_dtypes(exclude='object').eq(
    df2.select_dtypes(exclude='object')) * 100
# Merge Similarities together and take the average per row
total_similarity = pd.concat((
    obj_similarity, other_similarity
), axis=1).mean(axis=1)

df2['match_acc'] = total_similarity

df2

   QR_score Identity_No             comp_name comp_code  match_acc
0    200.00     IN2231D           AXN pvt Inc     IN225      93.25
1    420.00     UK655IN        Aviva Intl Ltd     IN315      66.50
2    350.35     SL2252H              Ship Inc     CK555      49.00
3    450.00     LK9978G  Oppo Mobiles pvt ltd     PRS95      57.75
4    590.50     NG5678J             Nokia Inc     RS885      75.00
5    250.00     IN5531D           AXN pvt Ltd     IN215      92.75

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM