简体   繁体   中英

Extraction speed in Matrix package is very slow compared to regular matrix class

This is an example of comparing row extraction from large matrices, sparse and dense, using the Matrix package versus the regular R base-matrix class.

For dense matrices the speed is almost 395 times faster for the base class matrix :

library(Matrix)
library(microbenchmark)

## row extraction in dense matrices
D1<-matrix(rnorm(2000^2), 2000, 2000)
D2<-Matrix(D1)
> microbenchmark(D1[1,], D2[1,])
Unit: microseconds
    expr      min        lq       mean    median       uq      max neval
 D1[1, ]   14.437   15.9205   31.72903   31.4835   46.907   75.101   100
 D2[1, ] 5730.730 5744.0130 5905.11338 5777.3570 5851.083 7447.118   100

For sparse matrices it is almost 63 times in favor of matrix again.

## row extraction in sparse matrices
S1<-matrix(1*(runif(2000^2)<0.1), 2000, 2000)
S2<-Matrix(S1, sparse = TRUE)
microbenchmark(S1[1,], S2[1,])
Unit: microseconds
    expr      min       lq       mean    median        uq      max neval
 S1[1, ]   15.225   16.417   28.15698   17.7655   42.9905   45.692   100
 S2[1, ] 1652.362 1670.507 1771.51695 1774.1180 1787.0410 5241.863   100

Why the speed discrepancy, and is there a way to speed up extraction in Matrix package?

I don't know exactly what the trouble is, possibly S4 dispatch (which could potentially be a big piece of a small call like this). I was able to get performance equivalent to matrix (which has a pretty easy job, indexing + accessing a contiguous chunk of memory) by (1) switching to a row-major format and (2) writing my own special-purpose accessor function. I don't know exactly what you want to do or if it will be worth the trouble ...

Set up example:

set.seed(101)
S1 <- matrix(1*(runif(2000^2)<0.1), 2000, 2000)

Convert to column-major ( dgCMatrix ) and row-major ( dgRMatrix ) forms:

library(Matrix)
S2C <- Matrix(S1, sparse = TRUE)
S2R <- as(S1,"dgRMatrix")

Custom accessor:

my_row_extract <- function(m,i=1) {
    r <- numeric(ncol(m))   ## set up zero vector for results
    ## suggested by @OttToomet, handles empty rows
    inds <- seq(from=m@p[i]+1, 
                to=m@p[i+1], length.out=max(0, m@p[i+1] - m@p[i]))
    r[m@j[inds]+1] <- m@x[inds]     ## set values
    return(r)
}

Check equality of results across methods (all TRUE ):

all.equal(S2C[1,],S1[1,])
all.equal(S2C[1,],S2R[1,])
all.equal(my_row_extract(S2R,1),S2R[1,])
all.equal(my_row_extract(S2R,17),S2R[17,])

Benchmark:

benchmark(S1[1,], S2C[1,], S2R[1,], my_row_extract(S2R,1),
          columns=c("test","elapsed","relative"))
##                     test elapsed relative
## 4 my_row_extract(S2R, 1)   0.015    1.154
## 1                S1[1, ]   0.013    1.000
## 2               S2C[1, ]   0.563   43.308
## 3               S2R[1, ]   4.113  316.385

The special-purpose extractor is competitive with base matrices. S2R is super-slow, even for row extraction (surprisingly); however, ?"dgRMatrix-class" does say

Note: The column-oriented sparse classes, eg, 'dgCMatrix', are preferred and better supported in the 'Matrix' package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM