简体   繁体   中英

calculate cosine similarity for two columns in a group by in a dataframe

I have a dataframe df :

AID   VID   FID   APerc   VPerc
1     A     X     0.2     0.5
1     A     Z     0.1     0.3
1     A     Y     0.4     0.9
2     A     X     0.2     0.3
2     A     Z     0.9     0.1
1     B     Z     0.1     0.2
1     B     Y     0.8     0.3
1     B     W     0.5     0.4
1     B     X     0.6     0.3

I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID . So the result for the above should be:

AID   VID   CosSim   
1     A     0.997   
2     A     0.514    
1     B     0.925     

I know how to groupby: df.groupby(['AID','VID'])

and I know how to generate cosine similarity for the whole column:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])

What's the best and fastest way to do this, given I have a really large file.

Pairwise cosine_similarity is designed for 2D arrays so you'll need to do some reshaping before and after. Instead of that, use scipy's cosine distance:

from scipy.spatial.distance import cosine
df.groupby(['AID','VID']).apply(lambda x: 1 - cosine(x['APerc'], x['VPerc']))
Out: 
AID  VID
1    A      0.997097
     B      0.924917
2    A      0.514496
dtype: float64

Timing on a df of shape (10k, 5) gives 2.87ms for scipy and 4.08ms for sklearn. A fair amount of that 4.08ms is probably due to the warnings it outputs because with Alexander's version it drops down to 3.31ms. I suspect sklearn version becomes much faster when called on a single 2D array.

Not sure if it is the fastest , groupby.apply is usually the way to do this:

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))

#AID  VID
#1    A      0.997097
#     B      0.924917
#2    A      0.514496
#dtype: float64

Extend the solution of @Psidom to convert the series to numpy arrays before calculating cosine_similarity and also reshape:

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'].values.reshape(1, -1), 
                                      g['VPerc'].values.reshape(1, -1))[0][0]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM