简体   繁体   English

R package Matrix:获取稀疏矩阵每行/每列的非零条目数

[英]R package Matrix: get number of non-zero entries per rows / columns of a sparse matrix

I have a large sparse matrix ("dgCMatrix", dimension 5e+5 x 1e+6). 我有一个大的稀疏矩阵(“ dgCMatrix”,尺寸5e + 5 x 1e + 6)。 I need to count for each column how many non-zero values there are and make a list of column names with only 1 non-zero entry. 我需要为每列计数有多少个非零值,并创建仅包含1个非零条目的列名列表。

My code works for small matrices, but becomes too computationally intensive for the actual matrix I need to work on. 我的代码适用于小型矩阵,但对于我需要处理的实际矩阵而言,计算量太大。

library(Matrix)
set.seed(0)
mat <- Matrix(matrix(rbinom(200, 1, 0.10), ncol = 20))
colnames(mat) <- letters[1:20]

entries <- colnames(mat[, nrow(mat) - colSums(mat == 0) == 1])

Any suggestion is very welcome! 任何建议都非常欢迎!

Similar results are produced using the following: Please notice the provided comments: 使用以下方法会产生类似的结果:请注意提供的注释:

## `mat != 0` returns a "lgCMatrix" which is sparse
## don't try `mat == 0` as that is dense, simply because there are too many zeros
entries <- colnames(mat)[colSums(mat != 0) == 1]

I have a large sparse matrix ("dgCMatrix") 我有一个大的稀疏矩阵(“ dgCMatrix”)

Let us call it dgCMat . 让我们将其dgCMat

I need to count for each column how many non-zero values there are 我需要为每列计数有多少个非零值

xx <- diff(dgCMat@p)

and make a list of column names with only 1 non-zero entry 并列出只有1个非零条目的列名

colnames(dgCMat)[xx == 1]

summary 摘要

nnz: number of non-zeros nnz:非零数

For a "dgCMatrix" dgCMat : 对于“ dgCMatrix” dgCMat

## nnz per column
diff(dgCMat@p)

## nnz per row
tabulate(dgCMat@i + 1)

For a "dgRMatrix" dgRMat : 对于“ dgRMatrix” dgRMat

## nnz per column
tabulate(dgRMat@j + 1)

## nnz per row
diff(dgRMat@p)

For a "dgTMatrix" dgTMat : 对于“ dgTMatrix” dgTMat

## nnz per column
tabulate(dgTMat@j + 1)

## nnz per row
tabulate(dgTMat@i + 1)

I did not read your original code when posting this answer. 发布此答案时,我没有阅读您的原始代码。 So I did not know that you got stuck with the use of mat == 0 . 所以我不知道您对使用mat == 0感到困惑。 Only till later I added the difference between mat == 0 and mat != 0 in your answer. 直到后来,我才在您的答案中加上了mat == 0mat != 0之间的差异。

Your workaround using mat != 0 well exploits the package's feature. 使用mat != 0解决方法可以很好地利用软件包的功能。 That same line of code should work with other sparse matrix classes, too. 相同的代码行也应与其他稀疏矩阵类一起使用。 Mine goes straight to the internal storage, hence different versions are required for different classes. 我的直接进入内部存储,因此不同的类需要不同的版本。

I don't have the rep to comment on the accepted answer, but I wanted to point out that the tabulate-based functions fall apart if there are no non-zero entries in the last few columns / rows. 我没有代表对接受的答案发表评论,但我想指出的是,如果最后几列/行中没有非零条目,那么基于表格的函数就会崩溃。

This can be fixed by specifying the number of bins to tabulate: 可以通过指定要制表的箱数来解决此问题:

## nnz per row
tabulate(dgCMat@i + 1, nbins=nrow(dgCMat))

## nnz per column
tabulate(dgRMat@j + 1, nbins=ncol(dgRMat))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM