I am looking to find the percentage difference between two dataframes. I have tried using fuzzywuzzy but not getting the expected output for the same.
Suppose i have 2 dataframes with 3 columns each, i want to find the match percentage between these 2 dataframes.
df1
score id_number company_name company_code
200 IN2231D AXN pvt Ltd IN225
450 UK654IN Aviva Intl Ltd IN115
650 SL1432H Ship Incorporations CZ555
350 LK0678G Oppo Mobiles pvt ltd PQ795
590 NG5678J Nokia Inc RS885
250 IN2231D AXN pvt Ltd IN215
df2
QR_score Identity_No comp_name comp_code match_acc
200.00 IN2231D AXN pvt Inc IN225
420.0 UK655IN Aviva Intl Ltd IN315
350.35 SL2252H Ship Inc CK555
450.0 LK9978G Oppo Mobiles pvt ltd PRS95
590.5 NG5678J Nokia Inc RS885
250.0 IN5531D AXN pvt Ltd IN215
Code i am using:
df1 = df[['score','id_number','company_code']]
df2 = df[['QR_score','identity_No','comp_code']]
for idx, row1 in df1.iterrows():
for idx2, row2 in df2.iterrows():
df2['match_acc'] =
Suppose if first row in both the dataframe is matching by 75% so it will be listed in df2['match_acc'] column, same to be followed for each row.
IIUC rename the columns to match then use eq
+ mean
on axis 1:
df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100
df2
:
QR_score Identity_No comp_name comp_code match_acc
0 200.00 IN2231D AXN pvt Inc IN225 75.0
1 420.00 UK655IN Aviva Intl Ltd IN315 25.0
2 350.35 SL2252H Ship Inc CK555 0.0
3 450.00 LK9978G Oppo Mobiles pvt ltd PRS95 25.0
4 590.50 NG5678J Nokia Inc RS885 75.0
5 250.00 IN5531D AXN pvt Ltd IN215 75.0
Complete Working Example
import pandas as pd
df1 = pd.DataFrame({
'score': [200, 450, 650, 350, 590, 250],
'id_number': ['IN2231D', 'UK654IN', 'SL1432H', 'LK0678G', 'NG5678J',
'IN2231D'],
'company_name': ['AXN pvt Ltd', 'Aviva Intl Ltd', 'Ship Incorporations',
'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
'company_code': ['IN225', 'IN115', 'CZ555', 'PQ795', 'RS885', 'IN215']
})
df2 = pd.DataFrame({
'QR_score': [200.00, 420.0, 350.35, 450.0, 590.5, 250.0],
'Identity_No': ['IN2231D', 'UK655IN', 'SL2252H', 'LK9978G', 'NG5678J',
'IN5531D'],
'comp_name': ['AXN pvt Inc', 'Aviva Intl Ltd', 'Ship Inc',
'Oppo Mobiles pvt ltd', 'Nokia Inc', 'AXN pvt Ltd'],
'comp_code': ['IN225', 'IN315', 'CK555', 'PRS95', 'RS885', 'IN215']
})
df1.columns = df2.columns
df2['match_acc'] = df1.eq(df2).mean(axis=1) * 100
print(df2)
Assuming cell by cell similarity should be assessed by something like fuzzywuzzy
instead, vectorize
whatever fuzzywuzzy
function to apply to all cells and create a new dataframe from the results. fuzzywuzzy
will only work with strings, so handle object
type columns and non-objects separately.
import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz
# Make Column Names Match
df1.columns = df2.columns
# Select string (object) columns
t1 = df1.select_dtypes(include='object')
t2 = df2.select_dtypes(include='object')
# Apply fuzz.ratio to every cell of both frames
obj_similarity = pd.DataFrame(np.vectorize(fuzz.ratio)(t1, t2),
columns=t1.columns,
index=t1.index)
# Use non-object similarity with eq
other_similarity = df1.select_dtypes(exclude='object').eq(
df2.select_dtypes(exclude='object')) * 100
# Merge Similarities together and take the average per row
total_similarity = pd.concat((
obj_similarity, other_similarity
), axis=1).mean(axis=1)
df2['match_acc'] = total_similarity
df2
:
QR_score Identity_No comp_name comp_code match_acc
0 200.00 IN2231D AXN pvt Inc IN225 93.25
1 420.00 UK655IN Aviva Intl Ltd IN315 66.50
2 350.35 SL2252H Ship Inc CK555 49.00
3 450.00 LK9978G Oppo Mobiles pvt ltd PRS95 57.75
4 590.50 NG5678J Nokia Inc RS885 75.00
5 250.00 IN5531D AXN pvt Ltd IN215 92.75
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.