使用两个看起来相同的稀疏矩阵创建Jaccard相似性矩阵时，为什么会得到两种不同的性能

Question

I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. 当我尝试使用text2vec包中的sim2（）创建Jaccard相似性矩阵时，我为一个奇怪的性能问题感到困惑。 I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above. 我有一个稀疏矩阵[210,000 x 500]，我想为此获得Jaccard相似性矩阵。 When I directly try to use the matrix in the sim2 function, it takes over 30 minutes and culminutes in error message 当我直接尝试在sim2函数中使用矩阵时，将花费30分钟以上的时间显示错误消息

This is the R script I use: 这是我使用的R脚本：

library(text2vec)
JaccSim <- sim2(my_sparse_mx, method = "jaccard", norm = "none")  # doesn't work

This is the error message i get after half an hour of running the script: 这是运行脚本半小时后收到的错误消息：

Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92. 文件../Core/cholmod_sparse.c，第92行的Cholmod错误“问题太大”。

However, when I subset another sparse matrix from the original matrix, using all the rows and run the script, it takes only 3 minutes and the Jaccard similarity matrix (which is a sparse matrix itself) is generated successfully. 但是，当我使用所有行从原始矩阵中子集另一个稀疏矩阵并运行脚本时，只花了3分钟，便成功生成了Jaccard相似性矩阵（本身是稀疏矩阵）。

spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
JaccSim <- sim2(spmx_1, method = "jaccard", norm = "none") #works!

This one runs successfully. 这个成功运行。 What is going on here? 这里发生了什么？ all I'm doing is subsetting my sparse_matrix into another matrix (using all rows of the original matrix) and using the second sparse matrix. 我要做的就是将sparse_matrix细分为另一个矩阵（使用原始矩阵的所有行）并使用第二个稀疏矩阵。

To clarify, my_sparse_mx has 210,000 rows (i created it having that many rows using the following: 为了明确起见，my_sparse_mx有210,000行（我使用以下命令创建了具有这么多行的行：

my_sparse_mx <-Matrix(0,nrow = 210000,ncol = 500,sparse = TRUE))

and then filled it up with 1's accordingly throughout some other process. 然后在其他过程中将其相应地填充为1。 Also, when I do nrows(my_sparse_mx) I still get 210,000. 另外，当我做rows（my_sparse_mx）时，我仍然得到210,000。

I'd like to know why this is happening. 我想知道为什么会这样。

Answer 1

spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)

says to take the first 210000 elements of my_sparse_matrix and turn it into another matrix. 表示将my_sparse_matrix的前210000个元素转换为另一个矩阵。 The result of this will have 210000 rows and 1 column. 结果将有210000行和1列。

You probably wanted 你可能想要

spmx_1 <- Matrix(my_sparse_mx[1:210000, ], sparse = TRUE)

with the comma. 用逗号。

Answer 2

sim2 function calculates pairwise jaccard similarity which means result matrix for your case will be 210000*210000. sim2函数计算成对Jaccard相似，这意味着你的情况下，结果矩阵将是210000 * 210000。 Sparsity of this resulting matrix depends on the data and for some cases won't be a problem. 此结果矩阵的稀疏性取决于数据，在某些情况下不会有问题。 I guess for your case it is quite dense and can't be handled by underlying Matrix routines. 我猜对于您而言，它非常密集并且无法通过基础Matrix例程进行处理。

Your subsetting as mentioned above is not correct - you missed comma. 上述子集不正确-您错过了逗号。 So you subset just first 210000 elements. 因此，您仅将前210000个元素作为子集。

使用两个看起来相同的稀疏矩阵创建Jaccard相似性矩阵时，为什么会得到两种不同的性能

问题描述

2 个解决方案

解决方案1
1 2017-06-23 14:50:37

解决方案2
1 已采纳 2017-06-23 14:51:18

使用两个看起来相同的稀疏矩阵创建Jaccard相似性矩阵时，为什么会得到两种不同的性能

问题描述

2 个解决方案

解决方案1 1 2017-06-23 14:50:37

解决方案2 1 已采纳 2017-06-23 14:51:18

解决方案1
1 2017-06-23 14:50:37

解决方案2
1 已采纳 2017-06-23 14:51:18