简体   繁体   中英

Sparse x dense matrix multiply unexpectedly slow with Armadillo

This is something I just came across. For some reason, multiplying a dense by a sparse matrix in Armadillo is much slower than multiplying a sparse and dense matrix (ie, reversing the order).

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp(arma::sp_mat& a, arma::mat& b)
{
    // sparse x dense -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_den_sp_to_sp(arma::mat& a, arma::sp_mat& b)
{
    // dense x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

I'm using the RcppArmadillo package to interface Arma with R; RcppArmadillo.h includes armadillo . Here's some timings in R, on a couple of reasonably large mats:

set.seed(98765)
# 10000 x 10000 sparse matrices, 99% sparse
a <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)
b <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)

# dense copies
a_den <- as.matrix(a)
b_den <- as.matrix(b)

system.time(mult_sp_den_to_sp(a, b_den))
#   user  system elapsed 
# 508.66    0.79  509.95 

system.time(mult_den_sp_to_sp(a_den, b))
#   user  system elapsed 
#  13.52    0.74   14.29 

So the first multiply takes about 35 times longer than the second (all times are in seconds).

Interestingly, if I simply make a temporary sparse copy of the dense matrix, performance is much improved:

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp2(arma::sp_mat& a, arma::mat& b)
{
    // sparse x dense -> sparse
    // copy dense to sparse, then multiply
    arma::sp_mat temp(b);
    arma::sp_mat result(a * temp);
    return result;
}
system.time(mult_sp_den_to_sp2(a, b_den))
#   user  system elapsed 
#   5.45    0.41    5.86 

Is this expected behaviour? I'm aware that with sparse matrices, the exact way in which you do things can have big impacts on the efficiency of your code, much more so than with dense. A 35x difference in speed seems rather large though.

Sparse and dense matrices are stored in a very different way. Armadillo uses CMS (column-major storage) for dense matrices, and CSC (compressed sparse column) for sparse matrices. From Armadillo's documentation:

Mat
mat
cx_mat
Classes for dense matrices, with elements stored in column-major ordering (ie. column by column)

SpMat
sp_mat
sp_cx_mat
Classes for sparse matrices, with elements stored in compressed sparse column (CSC) format

The first thing we have to understand is how much storage space each format requires:

Given the quantities element_size (4 bytes for single precision, 8 bytes for double precision), index_size (4 bytes if using 32-bit integers, or 8 bytes if using 64-bit integers), num_rows (the number of rows of the matrix), num_cols (the number of columns of the matrix), and num_nnz (number of nonzero elements of the matrix), the following formule give us the storage space for each format:

storage_cms = num_rows * num_cols * element_size
storage_csc = num_nnz * element_size + num_nnz * index_size + num_cols * index_size

For more details about storage formats see wikipedia , or netlib .

Assuming double precision and 32-bit indeces, in your case that means:

storage_cms = 800MB
storage_csc = 12.04MB

So when you are multiplying a sparse x dense (or dense x sparse) matrix, you are accessing ~812MB of memory, while you only access ~24MB of memory when multiplying sparse x sparse matrix.

Note that this doesn't include the memory where you write the results, and this can be a significant portion (up to ~800MB in both cases), but I am not very familiar with Armadillo and which algorithm it uses for matrix multiplication, so cannot exactly say how it stores the intermediate results.

Whatever the algorithm, it definitely needs to access both input matrices multiple times, which explains why converting a dense matrix to sparse (which requires only one access to the 800MB of dense matrix), and then doing a sparse x sparse product (which requires accessing 24MB of memory multiple times) is more efficient than dense x sparse and sparse x dense product.

There are also all sorts of cache effects here, which would require the knowledge of the exact implementation of the algorithm and the hardware (and a lot of time) to explain properly, but above is the general idea.

As for why is dense x sparse faster than sparse x dense , it is because of the CSC storage format for sparse matrices. As noted in scipy's documentation , CSC format is efficient for column slicing, and slow for row slicing. dense x sparse multiplication algorithms need column slicing of the sparse matrix, and sparse x dense need row slicing of the sparse matrix. Note that if armadillo used CSR instead of CSC, sparse x dense would be efficient, and dense x sparse wouldn't.

I am aware that this is not a complete answer of all the performance effects you are seeing, but should give you a general idea of what is happening. A proper analysis would require a lot more time and effort to do, and would have to include concrete implementations of the algorithms, and information about the hardware on which it is run.

This should be fixed in the upcoming Armadillo 8.500 , which will be wrapped in RcppArmadillo 0.8.5 Real Soon Now. Specifically:

  • sparse matrix transpose is much faster
  • (sparse x dense) reimplemented as ((dense^T) x (sparse^T))^T , taking advantage of the relatively speedy (dense x sparse) code

When I tested it, the time taken dropped from ~500 seconds to about 18 seconds, which is comparable to the other timings.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM