简体   繁体   English

删除 R 中具有大数据的稀疏矩阵行的有效方法

[英]Efficient way to remove Sparse matrix row having large data in R

I have a large sparse Matrix.我有一个大的稀疏矩阵。 Based on the number of elements in a row, I want to remove the row of the sparse Matrix.根据一行中元素的数量,我想删除稀疏矩阵的行。 The first step is to obtain the indexes of the row that has less that 5 elements.第一步是获取少于 5 个元素的行的索引。

number of rows = 4,000,000 number of columns = 250,000行数 = 4,000,000 列数 = 250,000

The sparse matrix looks somewhat like this...稀疏矩阵看起来有点像这样......

A header标题 C1 C1 C2 C2 C3 C3 C4 C4 C5 C5 ... ... n n
First第一的 . . 1 1个 1 1个 1 1个 1 1个 . . . .
Second第二 . . . . 1 1个 1 1个 . . . . . .
Third第三 1 1个 . . . . . . . . . . . .
Fourth第四 1 1个 . . . . . . 1 1个 . . . .
nth第n个 1 1个 . . . . . . . . . . . .

I could use rowSums, and remove the row based on the output.我可以使用 rowSums,并根据输出删除行。

items <- c()
for(i in 1: nrow(sparse_matrix){
  if(rowSums(sparse_matrix)[i] < 3){
    items <- append(items, i)
  }
}

However, this takes 1hr 30mins to go through around 10,000 rows which is really slow.但是,这需要 1 小时 30 分钟才能完成大约 10,000 行,这非常慢。

What would be an efficient solution to this?什么是有效的解决方案?

First, you calculate rowSums(sparse_matrix) at every iteration of the loop, which is inefficient.首先,您在循环的每次迭代中计算rowSums(sparse_matrix) ,这是低效的。 Second, doing append(items, i) in a loop is also inefficient.其次,在循环中执行append(items, i)也是低效的。

My solution without a loop at all:我的解决方案根本没有循环:

items = which(rowSums(sparse_matrix) < 3)

Following up @AndreyShabalin's answer with an example:用一个例子跟进@AndreyShabalin 的回答:

example例子

Set up a matrix with the specified dimensions and 1 million non-zero elements:设置具有指定维度和 100 万个非零元素的矩阵:

nr <- 4e6; nc <- 2.5e4; ns <- 1e6
m <- Matrix(0, nrow = nr, ncol = nc)
set.seed(101)
i <- sample(nr, size = ns, replace = TRUE)
j <- sample(nc, size = ns, replace = TRUE)
m[cbind(i,j)] <- 1

subset子集

system.time(m2 <- m[rowSums(m) > 3, ]) ## 0.03 seconds
nrow(m2)  ## 543

By avoiding (1) recomputing rowSums() and (2) not growing a vector, we make this very fast.通过避免 (1) 重新计算rowSums()和 (2) 不增长向量,我们可以非常快地完成此操作。

library(microbenchmark)
microbenchmark(
  logical = m[rowSums(m) > 3, ],
  which = m[which(rowSums(m) > 3), ]
)

For reasons I don't understand, the solution with which is actually slightly faster than the logical-indexing solution (median of 30 vs 35 milliseconds...)由于我不明白的原因,该解决which实际上比逻辑索引解决方案稍(中位数为 30 对 35 毫秒......)

Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval cld
 logical 33.13682 35.07974 45.65183 35.61339 37.09483 160.3382   100   b
   which 27.90955 29.90988 37.06513 30.38864 31.76437 153.8702   100  a 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM