简体   繁体   English

Scipy稀疏矩阵在余弦相似度方面的存储效率不高

[英]Scipy sparse matrices are not memory efficient in cosine similarity

I am trying to implement cosine similarity using scipy sparse matrices, as I am getting memory error with the normal matrices (non-sparse). 我正在尝试使用scipy稀疏矩阵实现余弦相似度,因为我在使用普通矩阵(非稀疏)时遇到内存错误。 However, I noticed that the memory size (in bytes) of the cosine similarity of sparse and non-sparse matrices is almost the same when the size of the input matrix (observations) is large. 但是,我注意到,当输入矩阵(观测值)的大小很大时,稀疏和非稀疏矩阵的余弦相似性的内存大小(以字节为单位)几乎相同。 Am I doing something wrong, or, is there a way around this? 我是在做错什么,还是有办法解决? Here's the code where the input has 5% as 1's and 95% as 0's. 这是代码,其中输入的5%为1,95%为0。

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse
A = np.random.rand(10000, 1000)<.05
A_sparse = sparse.csr_matrix(A)
similarities = cosine_similarity(A_sparse)

# output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)

print("1's percentage", np.count_nonzero(A)/np.size(A))
print('memory percentage', similarities_sparse.data.nbytes/similarities.data.nbytes)

Output of one rune is: 一个符文的输出为:

1's percentage 0.0499615
memory percentage 0.91799018

Elaborating @hpaulj's comments into an answer: 将@hpaulj的评论详细说明为答案:

Both your calls to cosine_similarity return the same underlying data. 您对cosine_similarity两次调用cosine_similarity返回相同的基础数据。 That cosine similarity matrix isn't mostly zeros, so using a sparse format doesn't save space. 该余弦相似度矩阵通常不是零,因此使用稀疏格式不会节省空间。

Input data that's mostly zeros doesn't necessarily (or even typically) yield a cosine similarity matrix that's mostly zeros. 大部分为零的输入数据不一定(或者甚至通常)会产生大部分为零的余弦相似度矩阵。 Cosine(i,j) = 0 only occurs(*) for a pair of rows (i, j) of the matrix if they have no values in any of the same columns. 如果矩阵的一对行(i,j)在同一列中均没有值,则余弦(i,j)= 0仅会发生(*)。

(* Or if the dot product otherwise comes out to 0, but that's a side point here.) (*或如果点积否则为0,但这是这里的一个补充。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM