通过数据帧计算一组中两列的余弦相似度

Question

I have a dataframe df : 我有一个数据框df ：

AID   VID   FID   APerc   VPerc
1     A     X     0.2     0.5
1     A     Z     0.1     0.3
1     A     Y     0.4     0.9
2     A     X     0.2     0.3
2     A     Z     0.9     0.1
1     B     Z     0.1     0.2
1     B     Y     0.8     0.3
1     B     W     0.5     0.4
1     B     X     0.6     0.3

I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID . 我想为所有AID和VID对计算值APerc和VPerc的余弦相似度。 So the result for the above should be: 因此，以上结果应为：

AID   VID   CosSim   
1     A     0.997   
2     A     0.514    
1     B     0.925

I know how to groupby: df.groupby(['AID','VID']) 我知道如何df.groupby(['AID','VID']) ： df.groupby(['AID','VID'])

and I know how to generate cosine similarity for the whole column: 而且我知道如何为整个列生成余弦相似度：

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])

What's the best and fastest way to do this, given I have a really large file. 考虑到我的文件很大，什么是最好和最快的方法？

Answer 1

Pairwise cosine_similarity is designed for 2D arrays so you'll need to do some reshaping before and after. 成对的cosine_similarity是为2D数组设计的，因此您需要在前后进行一些重塑。 Instead of that, use scipy's cosine distance: 而是使用scipy的cosine距离：

from scipy.spatial.distance import cosine
df.groupby(['AID','VID']).apply(lambda x: 1 - cosine(x['APerc'], x['VPerc']))
Out: 
AID  VID
1    A      0.997097
     B      0.924917
2    A      0.514496
dtype: float64

Timing on a df of shape (10k, 5) gives 2.87ms for scipy and 4.08ms for sklearn. 在形状为df（10k，5）的时间上，scipy的时间为2.87ms，sklearn的时间为4.08ms。 A fair amount of that 4.08ms is probably due to the warnings it outputs because with Alexander's version it drops down to 3.31ms. 相当大的4.08ms可能是由于它输出的警告，因为在Alexander版本中，它下降到3.31ms。 I suspect sklearn version becomes much faster when called on a single 2D array. 我怀疑在单个2D阵列上调用sklearn时版本会变得更快。

Answer 2

Not sure if it is the fastest , groupby.apply is usually the way to do this: 不知道它是否最快， groupby.apply通常是这样做的方法：

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))

#AID  VID
#1    A      0.997097
#     B      0.924917
#2    A      0.514496
#dtype: float64

Answer 3

Extend the solution of @Psidom to convert the series to numpy arrays before calculating cosine_similarity and also reshape: 扩展@Psidom的解决方案，以在计算cosine_similarity之前将系列转换为numpy数组，并重塑cosine_similarity ：

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'].values.reshape(1, -1), 
                                      g['VPerc'].values.reshape(1, -1))[0][0]))

通过数据帧计算一组中两列的余弦相似度

问题描述

3 个解决方案

解决方案1
4 2017-08-06 19:48:17

解决方案2
4 已采纳 2017-08-06 19:48:21

解决方案3
2 2017-08-06 19:55:26

通过数据帧计算一组中两列的余弦相似度

问题描述

3 个解决方案

解决方案1 4 2017-08-06 19:48:17

解决方案2 4 已采纳 2017-08-06 19:48:21

解决方案3 2 2017-08-06 19:55:26

解决方案1
4 2017-08-06 19:48:17

解决方案2
4 已采纳 2017-08-06 19:48:21

解决方案3
2 2017-08-06 19:55:26