[英]calculate cosine similarity for two columns in a group by in a dataframe
I have a dataframe df
: 我有一个数据框df
:
AID VID FID APerc VPerc
1 A X 0.2 0.5
1 A Z 0.1 0.3
1 A Y 0.4 0.9
2 A X 0.2 0.3
2 A Z 0.9 0.1
1 B Z 0.1 0.2
1 B Y 0.8 0.3
1 B W 0.5 0.4
1 B X 0.6 0.3
I want to calculate the cosine similarity of the values APerc
and VPerc
for all pairs of AID
and VID
. 我想为所有AID
和VID
对计算值APerc
和VPerc
的余弦相似度。 So the result for the above should be: 因此,以上结果应为:
AID VID CosSim
1 A 0.997
2 A 0.514
1 B 0.925
I know how to groupby: df.groupby(['AID','VID'])
我知道如何df.groupby(['AID','VID'])
: df.groupby(['AID','VID'])
and I know how to generate cosine similarity for the whole column: 而且我知道如何为整个列生成余弦相似度:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df['APerc'], df['VPerc'])
What's the best and fastest way to do this, given I have a really large file. 考虑到我的文件很大,什么是最好和最快的方法?
Pairwise cosine_similarity
is designed for 2D arrays so you'll need to do some reshaping before and after. 成对的cosine_similarity
是为2D数组设计的,因此您需要在前后进行一些重塑。 Instead of that, use scipy's cosine
distance: 而是使用scipy的cosine
距离:
from scipy.spatial.distance import cosine
df.groupby(['AID','VID']).apply(lambda x: 1 - cosine(x['APerc'], x['VPerc']))
Out:
AID VID
1 A 0.997097
B 0.924917
2 A 0.514496
dtype: float64
Timing on a df of shape (10k, 5) gives 2.87ms for scipy and 4.08ms for sklearn. 在形状为df(10k,5)的时间上,scipy的时间为2.87ms,sklearn的时间为4.08ms。 A fair amount of that 4.08ms is probably due to the warnings it outputs because with Alexander's version it drops down to 3.31ms. 相当大的4.08ms可能是由于它输出的警告,因为在Alexander版本中,它下降到3.31ms。 I suspect sklearn version becomes much faster when called on a single 2D array. 我怀疑在单个2D阵列上调用sklearn时版本会变得更快。
Not sure if it is the fastest , groupby.apply
is usually the way to do this: 不知道它是否最快 , groupby.apply
通常是这样做的方法:
(df.groupby(['AID','VID'])
.apply(lambda g: cosine_similarity(g['APerc'], g['VPerc'])[0][0]))
#AID VID
#1 A 0.997097
# B 0.924917
#2 A 0.514496
#dtype: float64
Extend the solution of @Psidom to convert the series to numpy arrays before calculating cosine_similarity
and also reshape: 扩展@Psidom的解决方案,以在计算cosine_similarity
之前将系列转换为numpy数组,并重塑cosine_similarity
:
(df.groupby(['AID','VID'])
.apply(lambda g: cosine_similarity(g['APerc'].values.reshape(1, -1),
g['VPerc'].values.reshape(1, -1))[0][0]))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.