简体   繁体   English

使用两个看起来相同的稀疏矩阵创建Jaccard相似性矩阵时,为什么会得到两种不同的性能

[英]Why do I get two different performances when creating Jaccard similarity matrix using two sparse matrices that seem to be the same

I'm confounded by a strange performance issue when I try to create a Jaccard similarity matrix using sim2() from text2vec package. 当我尝试使用text2vec包中的sim2()创建Jaccard相似性矩阵时,我为一个奇怪的性能问题感到困惑。 I have a sparse matrix [210,000 x 500] for which I'd like to obtain Jaccard similarity matrix as mentioned above. 我有一个稀疏矩阵[210,000 x 500],我想为此获得Jaccard相似性矩阵。 When I directly try to use the matrix in the sim2 function, it takes over 30 minutes and culminutes in error message 当我直接尝试在sim2函数中使用矩阵时,将花费30分钟以上的时间显示错误消息

This is the R script I use: 这是我使用的R脚本:

library(text2vec)
JaccSim <- sim2(my_sparse_mx, method = "jaccard", norm = "none")  # doesn't work

This is the error message i get after half an hour of running the script: 这是运行脚本半小时后收到的错误消息:

Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92. 文件../Core/cholmod_sparse.c,第92行的Cholmod错误“问题太大”。

However, when I subset another sparse matrix from the original matrix, using all the rows and run the script, it takes only 3 minutes and the Jaccard similarity matrix (which is a sparse matrix itself) is generated successfully. 但是,当我使用所有行从原始矩阵中子集另一个稀疏矩阵并运行脚本时,只花了3分钟,便成功生成了Jaccard相似性矩阵(本身是稀疏矩阵)。

spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)
JaccSim <- sim2(spmx_1, method = "jaccard", norm = "none") #works!

This one runs successfully. 这个成功运行。 What is going on here? 这里发生了什么? all I'm doing is subsetting my sparse_matrix into another matrix (using all rows of the original matrix) and using the second sparse matrix. 我要做的就是将sparse_matrix细分为另一个矩阵(使用原始矩阵的所有行)并使用第二个稀疏矩阵。

To clarify, my_sparse_mx has 210,000 rows (i created it having that many rows using the following: 为了明确起见,my_sparse_mx有210,000行(我使用以下命令创建了具有这么多行的行:

my_sparse_mx <-Matrix(0,nrow = 210000,ncol = 500,sparse = TRUE))

and then filled it up with 1's accordingly throughout some other process. 然后在其他过程中将其相应地填充为1。 Also, when I do nrows(my_sparse_mx) I still get 210,000. 另外,当我做rows(my_sparse_mx)时,我仍然得到210,000。

I'd like to know why this is happening. 我想知道为什么会这样。

spmx_1 <- Matrix(my_sparse_mx[1:210000], sparse = TRUE)

says to take the first 210000 elements of my_sparse_matrix and turn it into another matrix. 表示将my_sparse_matrix的前210000个元素转换为另一个矩阵。 The result of this will have 210000 rows and 1 column. 结果将有210000行和1列。

You probably wanted 你可能想要

spmx_1 <- Matrix(my_sparse_mx[1:210000, ], sparse = TRUE)

with the comma. 用逗号。

sim2 function calculates pairwise jaccard similarity which means result matrix for your case will be 210000*210000. sim2函数计算成对Jaccard相似,这意味着你的情况下,结果矩阵将是210000 * 210000。 Sparsity of this resulting matrix depends on the data and for some cases won't be a problem. 此结果矩阵的稀疏性取决于数据,在某些情况下不会有问题。 I guess for your case it is quite dense and can't be handled by underlying Matrix routines. 我猜对于您而言,它非常密集并且无法通过基础Matrix例程进行处理。

Your subsetting as mentioned above is not correct - you missed comma. 上述子集不正确-您错过了逗号。 So you subset just first 210000 elements. 因此,您仅将前210000个元素作为子集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 R 和 Rcpp,如何将两个稀疏 Matrix::csr/csc 格式的矩阵相乘? - Using R and Rcpp, how to multiply two matrices that are sparse Matrix::csr/csc format? 如何获得两个矩阵的交叉? - How do I get the intersect of two matrices? 使用来自R中两个(稀疏)矩阵的数据构建“签名矩阵”的效率问题 - Efficiency problems building a “signature matrix” using data from two (sparse) matrices in R 在R中合并两个大小不同的dgCMatrix稀疏矩阵 - Merge two dgCMatrix sparse matrices of different size in R 用矩阵计算Jaccard相似系数 - Calculate Jaccard similarity coefficient with a matrix 如何计算数据帧中两行之间的jaccard相似度 - How to calculate jaccard similarity between two rows in data frame 当我使用不同的函数时,为什么会得到两个不同的输出? - Why do I get two different outputs when I use different functions? 在第三个矩阵之后对两个矩阵进行排序(使用%in%) - Ordering two matrices after a third matrix (using %in%) 如何在R中计算两个数据帧之间的Jaccard相似度 - How to calculate Jaccard similarity between two data frame with in R 计算 Jaccard 相似度指数中的两列 dataframe dplyr - Compute Jaccard similarity index two columns in dataframe dplyr
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM