[英]R package Matrix: get number of non-zero entries per rows / columns of a sparse matrix
I have a large sparse matrix ("dgCMatrix", dimension 5e+5 x 1e+6). 我有一个大的稀疏矩阵(“ dgCMatrix”,尺寸5e + 5 x 1e + 6)。 I need to count for each column how many non-zero values there are and make a list of column names with only 1 non-zero entry.
我需要为每列计数有多少个非零值,并创建仅包含1个非零条目的列名列表。
My code works for small matrices, but becomes too computationally intensive for the actual matrix I need to work on. 我的代码适用于小型矩阵,但对于我需要处理的实际矩阵而言,计算量太大。
library(Matrix)
set.seed(0)
mat <- Matrix(matrix(rbinom(200, 1, 0.10), ncol = 20))
colnames(mat) <- letters[1:20]
entries <- colnames(mat[, nrow(mat) - colSums(mat == 0) == 1])
Any suggestion is very welcome! 任何建议都非常欢迎!
Similar results are produced using the following: Please notice the provided comments: 使用以下方法会产生类似的结果:请注意提供的注释:
## `mat != 0` returns a "lgCMatrix" which is sparse
## don't try `mat == 0` as that is dense, simply because there are too many zeros
entries <- colnames(mat)[colSums(mat != 0) == 1]
I have a large sparse matrix ("dgCMatrix")
我有一个大的稀疏矩阵(“ dgCMatrix”)
Let us call it dgCMat
. 让我们将其
dgCMat
。
I need to count for each column how many non-zero values there are
我需要为每列计数有多少个非零值
xx <- diff(dgCMat@p)
and make a list of column names with only 1 non-zero entry
并列出只有1个非零条目的列名
colnames(dgCMat)[xx == 1]
summary 摘要
nnz: number of non-zeros nnz:非零数
For a "dgCMatrix" dgCMat
: 对于“ dgCMatrix”
dgCMat
:
## nnz per column
diff(dgCMat@p)
## nnz per row
tabulate(dgCMat@i + 1)
For a "dgRMatrix" dgRMat
: 对于“ dgRMatrix”
dgRMat
:
## nnz per column
tabulate(dgRMat@j + 1)
## nnz per row
diff(dgRMat@p)
For a "dgTMatrix" dgTMat
: 对于“ dgTMatrix”
dgTMat
:
## nnz per column
tabulate(dgTMat@j + 1)
## nnz per row
tabulate(dgTMat@i + 1)
I did not read your original code when posting this answer. 发布此答案时,我没有阅读您的原始代码。 So I did not know that you got stuck with the use of
mat == 0
. 所以我不知道您对使用
mat == 0
感到困惑。 Only till later I added the difference between mat == 0
and mat != 0
in your answer. 直到后来,我才在您的答案中加上了
mat == 0
和mat != 0
之间的差异。
Your workaround using mat != 0
well exploits the package's feature. 使用
mat != 0
解决方法可以很好地利用软件包的功能。 That same line of code should work with other sparse matrix classes, too. 相同的代码行也应与其他稀疏矩阵类一起使用。 Mine goes straight to the internal storage, hence different versions are required for different classes.
我的直接进入内部存储,因此不同的类需要不同的版本。
I don't have the rep to comment on the accepted answer, but I wanted to point out that the tabulate-based functions fall apart if there are no non-zero entries in the last few columns / rows. 我没有代表对接受的答案发表评论,但我想指出的是,如果最后几列/行中没有非零条目,那么基于表格的函数就会崩溃。
This can be fixed by specifying the number of bins to tabulate: 可以通过指定要制表的箱数来解决此问题:
## nnz per row
tabulate(dgCMat@i + 1, nbins=nrow(dgCMat))
## nnz per column
tabulate(dgRMat@j + 1, nbins=ncol(dgRMat))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.