I have two dataframes and want to join them based on three fields, A
, B
, and C
. However, A
and B
are numeric values and I want to them match exactly in my join/merge but C
is a string value and I want at least 80% match (similarity), ie if A
and B
have the same values in both dataframes and the value of C
in the first dataframe is abcde
and in the second one is abcdf
I still want to consider this record in my result. How can I implement this in python?
You can using fuzzywuzzy
from fuzzywuzzy import fuzz
df1=pd.DataFrame({'A':[1,3,2],'B':[2,2,3],'C':['aad','aac','aad']})
df2=pd.DataFrame({'A':[1,2,2],'B':[2,2,3],'C':['aad','aab','acd']})
mergedf1=df1.merge(df2,on=['A','B'])
mergedf1['ratio']=[fuzz.ratio(x,y) for x, y in zip(mergedf1['C_x'],mergedf1['C_y'])]
mergedf1#score list here , you can cut the data frame by your own limit
Out[265]:
A B C_x C_y ratio
0 1 2 aad aad 100
1 2 3 aad acd 67
I would probably merge first on only A and B, then filter out any rows that have low similarity on the C column, so something like:
result = df1.merge(df2, on=['A', 'B'])
# assuming sim is the similarity function that you created to calculate the similarity
idx = result.apply(lambda x: sim(c['C_x', 'C_y']) >= 0.8, axis=1)
result = result[idx]
Hope it helps!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.