简体   繁体   English

使用scikit-learn有效地计算余弦相似度

[英]Efficiently calculate cosine similarity using scikit-learn

After preprocessing and transforming (BOW, TF-IDF) data I need to calculate its cosine similarity with each other element of the dataset. 经过预处理和转换(BOW,TF-IDF)数据后,我需要计算其与数据集中其他元素的余弦相似度。 Currently, I do this: 目前,我这样做:

cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]

In this example, each input variable, eg tr_title , is a SciPy sparse matrix. 在此示例中,每个输入变量(例如tr_title )都是SciPy稀疏矩阵。 However, this code runs extremely slowly . 但是,此代码运行极其缓慢 What can I do to optimise the code so it runs more quickly? 我该如何优化代码,使其运行更快?

To improve performance you should replace the list comprehensions by vectorized code. 为了提高性能,您应该用向量化代码替换列表推导。 This can be easily implemented through Numpy's pdist and squareform as shown in the snippet below: 这可以通过numpy的的可以容易地实现pdistsquareform如示于下段:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

Demo : 演示

In [87]: X
Out[87]: 
<9x21 sparse matrix of type '<type 'numpy.int64'>'
    with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()          
Out[88]: 
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]: 
[u'attack',
 u'awakens',
 u'back',
 u'clones',
 u'empire',
 u'force',
 u'hope',
 u'jedi',
 u'last',
 u'menace',
 u'new',
 u'of',
 u'phantom',
 u'return',
 u'revenge',
 u'sith',
 u'star',
 u'story',
 u'strikes',
 u'the',
 u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0.    1.    1.    1.    1.    1.    1.    1.    1.  ]
 [ 1.    0.    0.75  0.71  0.75  0.75  0.71  1.    0.71]
 [ 1.    0.75  0.    0.71  0.5   0.5   0.71  1.    0.42]
 [ 1.    0.71  0.71  0.    0.71  0.71  0.67  1.    0.67]
 [ 1.    0.75  0.5   0.71  0.    0.5   0.71  1.    0.71]
 [ 1.    0.75  0.5   0.71  0.5   0.    0.71  1.    0.71]
 [ 1.    0.71  0.71  0.67  0.71  0.71  0.    1.    0.67]
 [ 1.    1.    1.    1.    1.    1.    1.    0.    1.  ]
 [ 1.    0.71  0.42  0.67  0.71  0.71  0.67  1.    0.  ]]

Notice that X.toarray().shape yields (9L, 21L) because in the toy example above there are 9 titles and 21 different words, whereas cs_title is a 9 by 9 array. 注意X.toarray().shape产生(9L, 21L)因为在上面的玩具示例中有9个标题和21个不同的单词,而cs_title9 x 9的数组。

You can reduce the effort for each of the calculations by over half by taking into account two characteristics of the cosine similarity of two vectors: 通过考虑两个向量的余弦相似度的两个特征,可以将每个计算的工作量减少一半以上:

  1. The cosine similarity of a vector with itself is one. 向量与自身的余弦相似度是1。
  2. The cosine similarity of vector x with vector y is the same as the cosine similarity of vector y with vector x . 向量x与向量y的余弦相似度与向量y与向量x的余弦相似度相同。

Therefore, calculate either the elements above the diagonal or below. 因此,计算对角线上方或下方的元素。

EDIT: Here's how you could calculate it. 编辑:这是您如何计算它。 Note especially that cs is just a dummy function to take the place of a real calculation of the similarity coefficient. 特别要注意的是, cs只是一个伪函数,可以代替相似系数的实际计算。

title1 = 'A four word title'
title2 = 'A five word title'
title3 = 'A six word title'
title4 = 'A seven word title'

titles = [title1, title2, title3, title4]
N = len(titles)

import numpy as np

similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N)

cs = lambda a,b: 10*a+b  # just a 'pretend' calculation of the coefficient

for m in range(N):
    similarity_matrix [m,m] = 1
    for n in range(m+1,N):
        similarity_matrix [m,n] = cs(m,n)
        similarity_matrix [n,m] = similarity_matrix [m,n]

print (similarity_matrix )

Here's the result. 这是结果。

[[  1.   1.   2.   3.]
 [  1.   1.  12.  13.]
 [  2.  12.   1.  23.]
 [  3.  13.  23.   1.]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM