[英]Why do I get two different performances when creating Jaccard similarity matrix using two sparse matrices that seem to be the same
I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. 当我尝试使用text2vec包中的sim2()创建Jaccard相似性矩阵时,我为一个奇怪的性能问题感到困惑。 I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above.
我有一个稀疏矩阵[210,000 x 500],我想为此获得Jaccard相似性矩阵。 When I directly try to use the matrix in the sim2 function, it takes over 30 minutes and culminutes in error message
当我直接尝试在sim2函数中使用矩阵时,将花费30分钟以上的时间显示错误消息
This is the R script I use: 这是我使用的R脚本:
library(text2vec)
JaccSim <- sim2(my_sparse_mx, method = "jaccard", norm = "none") # doesn't work
This is the error message i get after half an hour of running the script: 这是运行脚本半小时后收到的错误消息:
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92.
文件../Core/cholmod_sparse.c,第92行的Cholmod错误“问题太大”。
However, when I subset another sparse matrix from the original matrix, using all the rows and run the script, it takes only 3 minutes and the Jaccard similarity matrix (which is a sparse matrix itself) is generated successfully. 但是,当我使用所有行从原始矩阵中子集另一个稀疏矩阵并运行脚本时,只花了3分钟,便成功生成了Jaccard相似性矩阵(本身是稀疏矩阵)。
spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
JaccSim <- sim2(spmx_1, method = "jaccard", norm = "none") #works!
This one runs successfully. 这个成功运行。 What is going on here?
这里发生了什么? all I'm doing is subsetting my sparse_matrix into another matrix (using all rows of the original matrix) and using the second sparse matrix.
我要做的就是将sparse_matrix细分为另一个矩阵(使用原始矩阵的所有行)并使用第二个稀疏矩阵。
To clarify, my_sparse_mx has 210,000 rows (i created it having that many rows using the following: 为了明确起见,my_sparse_mx有210,000行(我使用以下命令创建了具有这么多行的行:
my_sparse_mx <-Matrix(0,nrow = 210000,ncol = 500,sparse = TRUE))
and then filled it up with 1's accordingly throughout some other process. 然后在其他过程中将其相应地填充为1。 Also, when I do nrows(my_sparse_mx) I still get 210,000.
另外,当我做rows(my_sparse_mx)时,我仍然得到210,000。
I'd like to know why this is happening. 我想知道为什么会这样。
spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
says to take the first 210000 elements of my_sparse_matrix
and turn it into another matrix. 表示将
my_sparse_matrix
的前210000个元素转换为另一个矩阵。 The result of this will have 210000 rows and 1 column. 结果将有210000行和1列。
You probably wanted 你可能想要
spmx_1 <- Matrix(my_sparse_mx[1:210000, ], sparse = TRUE)
with the comma. 用逗号。
sim2
function calculates pairwise jaccard similarity which means result matrix for your case will be 210000*210000. sim2
函数计算成对Jaccard相似,这意味着你的情况下,结果矩阵将是210000 * 210000。 Sparsity of this resulting matrix depends on the data and for some cases won't be a problem. 此结果矩阵的稀疏性取决于数据,在某些情况下不会有问题。 I guess for your case it is quite dense and can't be handled by underlying
Matrix
routines. 我猜对于您而言,它非常密集并且无法通过基础
Matrix
例程进行处理。
Your subsetting as mentioned above is not correct - you missed comma. 上述子集不正确-您错过了逗号。 So you subset just first 210000 elements.
因此,您仅将前210000个元素作为子集。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.