简体   繁体   中英

How to compare each row from one dataframe against all the rows from other dataframe and calculate distance measure?

I have two different customer dataframes and I would like to match them based on Jaccard distance matrix or any other method.

df1

 Name     country            cost
    0    raj  Kazakhstan     23
    1    sam      Russia     243
    2  kanan     Belarus     2
    3    Nan         Nan     0

df2

   Name     country   DOB
0   rak  Kazakhstan   12-12-1903
1   sim      russia   03-04-1994
2   raj     Belarus   21-09-2003
3  kane     Belarus   23-12-1999

Output:

if the string comparison value is greater than >0.6, I would like to combine both the rows in the new dataframe.

Df3

    Name     country   Name  country     cost   DOB
0    raj  Kazakhstan   rak   Kazakhstan  23     12-12-1903
1    sam      Russia   sim   russia      243    03-04-1994
2  kanan     Belarus   Kane  Belarus     2      23-12-1999

I had tried doing calculating each row against each row but don't how to compare each rows against entire rows from one to other dataframe?

I would like using fuzzywuzzy

from fuzzywuzzy import process

df1['key'] = df1.sum(1)
df2['key'] = df2.sum(1)


def yoursource(x):
    if [process.extract(x, df2.key.tolist(), limit=1)][0][0][1]>60:
        return [process.extract(x, df2.key.tolist(), limit=1)][0][0][0]
    else :
        return 'notmatch'

df1['key'] = df1.key.apply(yoursource)

After that we get the match key using merge

df = df1.merge(df2, on='key', how='inner').drop('key',1)
df
  Name_x   country_x Name_y   country_y
0    raj  Kazakhstan    rak  Kazakhstan
1    sam      Russia    sim      russia
2  kanan     Belarus   kane     Belarus

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM