简体   繁体   English

通过数据帧计算一组中两列的余弦相似度

[英]calculate cosine similarity for two columns in a group by in a dataframe

I have a dataframe df : 我有一个数据框df

AID   VID   FID   APerc   VPerc
1     A     X     0.2     0.5
1     A     Z     0.1     0.3
1     A     Y     0.4     0.9
2     A     X     0.2     0.3
2     A     Z     0.9     0.1
1     B     Z     0.1     0.2
1     B     Y     0.8     0.3
1     B     W     0.5     0.4
1     B     X     0.6     0.3

I want to calculate the cosine similarity of the values APerc and VPerc for all pairs of AID and VID . 我想为所有AIDVID对计算值APercVPerc的余弦相似度。 So the result for the above should be: 因此,以上结果应为:

AID   VID   CosSim   
1     A     0.997   
2     A     0.514    
1     B     0.925     

I know how to groupby: df.groupby(['AID','VID']) 我知道如何df.groupby(['AID','VID'])df.groupby(['AID','VID'])

and I know how to generate cosine similarity for the whole column: 而且我知道如何为整个列生成余弦相似度:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])

What's the best and fastest way to do this, given I have a really large file. 考虑到我的文件很大,什么是最好和最快的方法?

Pairwise cosine_similarity is designed for 2D arrays so you'll need to do some reshaping before and after. 成对的cosine_similarity是为2D数组设计的,因此您需要在前后进行一些重塑。 Instead of that, use scipy's cosine distance: 而是使用scipy的cosine距离:

from scipy.spatial.distance import cosine
df.groupby(['AID','VID']).apply(lambda x: 1 - cosine(x['APerc'], x['VPerc']))
Out: 
AID  VID
1    A      0.997097
     B      0.924917
2    A      0.514496
dtype: float64

Timing on a df of shape (10k, 5) gives 2.87ms for scipy and 4.08ms for sklearn. 在形状为df(10k,5)的时间上,scipy的时间为2.87ms,sklearn的时间为4.08ms。 A fair amount of that 4.08ms is probably due to the warnings it outputs because with Alexander's version it drops down to 3.31ms. 相当大的4.08ms可能是由于它输出的警告,因为在Alexander版本中,它下降到3.31ms。 I suspect sklearn version becomes much faster when called on a single 2D array. 我怀疑在单个2D阵列上调用sklearn时版本会变得更快。

Not sure if it is the fastest , groupby.apply is usually the way to do this: 不知道它是否最快groupby.apply通常是这样做的方法:

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))

#AID  VID
#1    A      0.997097
#     B      0.924917
#2    A      0.514496
#dtype: float64

Extend the solution of @Psidom to convert the series to numpy arrays before calculating cosine_similarity and also reshape: 扩展@Psidom的解决方案,以在计算cosine_similarity之前将系列转换为numpy数组,并重塑cosine_similarity

(df.groupby(['AID','VID'])
   .apply(lambda g: cosine_similarity(g['APerc'].values.reshape(1, -1), 
                                      g['VPerc'].values.reshape(1, -1))[0][0]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM