简体   繁体   English

使用犰狳,稀疏的x密集矩阵意外地变慢

[英]Sparse x dense matrix multiply unexpectedly slow with Armadillo

This is something I just came across. 这是我刚刚遇到的事情。 For some reason, multiplying a dense by a sparse matrix in Armadillo is much slower than multiplying a sparse and dense matrix (ie, reversing the order). 出于某种原因,在Armadillo中将密集乘以稀疏矩阵比乘以稀疏和密集矩阵(即,颠倒顺序)要慢得多。

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp(arma::sp_mat& a, arma::mat& b)
{
    // sparse x dense -> sparse
    arma::sp_mat result(a * b);
    return result;
}

// [[Rcpp::export]]
arma::sp_mat mult_den_sp_to_sp(arma::mat& a, arma::sp_mat& b)
{
    // dense x sparse -> sparse
    arma::sp_mat result(a * b);
    return result;
}

I'm using the RcppArmadillo package to interface Arma with R; 我正在使用RcppArmadillo软件包将Arma与R接口; RcppArmadillo.h includes armadillo . RcppArmadillo.h包括armadillo Here's some timings in R, on a couple of reasonably large mats: 这是R中的一些时间,在几个相当大的垫子上:

set.seed(98765)
# 10000 x 10000 sparse matrices, 99% sparse
a <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)
b <- rsparsematrix(1e4, 1e4, 0.01, rand.x=function(n) rpois(n, 1) + 1)

# dense copies
a_den <- as.matrix(a)
b_den <- as.matrix(b)

system.time(mult_sp_den_to_sp(a, b_den))
#   user  system elapsed 
# 508.66    0.79  509.95 

system.time(mult_den_sp_to_sp(a_den, b))
#   user  system elapsed 
#  13.52    0.74   14.29 

So the first multiply takes about 35 times longer than the second (all times are in seconds). 所以第一次乘法比第二次乘以大约35倍(所有时间都以秒为单位)。

Interestingly, if I simply make a temporary sparse copy of the dense matrix, performance is much improved: 有趣的是,如果我只是制作密集矩阵的临时稀疏副本,性能会大大提高:

// [[Rcpp::export]]
arma::sp_mat mult_sp_den_to_sp2(arma::sp_mat& a, arma::mat& b)
{
    // sparse x dense -> sparse
    // copy dense to sparse, then multiply
    arma::sp_mat temp(b);
    arma::sp_mat result(a * temp);
    return result;
}
system.time(mult_sp_den_to_sp2(a, b_den))
#   user  system elapsed 
#   5.45    0.41    5.86 

Is this expected behaviour? 这是预期的行为吗? I'm aware that with sparse matrices, the exact way in which you do things can have big impacts on the efficiency of your code, much more so than with dense. 我知道,对于稀疏矩阵,您执行操作的确切方式会对代码的效率产生很大影响,远远超过密集。 A 35x difference in speed seems rather large though. 然而,速度的35倍差异似乎相当大。

Sparse and dense matrices are stored in a very different way. 稀疏和密集矩阵以非常不同的方式存储。 Armadillo uses CMS (column-major storage) for dense matrices, and CSC (compressed sparse column) for sparse matrices. Armadillo使用CMS(列主存储)用于密集矩阵,而CSC(压缩稀疏列)用于稀疏矩阵。 From Armadillo's documentation: 来自Armadillo的文档:

Mat
mat
cx_mat cx_mat
Classes for dense matrices, with elements stored in column-major ordering (ie. column by column) 密集矩阵的类,其中元素以列主要顺序存储(即逐列)

SpMat SpMat
sp_mat sp_mat
sp_cx_mat sp_cx_mat
Classes for sparse matrices, with elements stored in compressed sparse column (CSC) format 稀疏矩阵的类,元素以压缩稀疏列(CSC)格式存储

The first thing we have to understand is how much storage space each format requires: 我们必须要了解的第一件事是每种格式需要多少存储空间:

Given the quantities element_size (4 bytes for single precision, 8 bytes for double precision), index_size (4 bytes if using 32-bit integers, or 8 bytes if using 64-bit integers), num_rows (the number of rows of the matrix), num_cols (the number of columns of the matrix), and num_nnz (number of nonzero elements of the matrix), the following formule give us the storage space for each format: 给定数量element_size (单精度为4个字节,双精度为8个字节), index_size (如果使用32位整数则为4个字节,如果使用64位整数则为8个字节), num_rows (矩阵的行数) , num_cols (矩阵的列数)和num_nnz (矩阵的非零元素的数量),以下的formule为我们提供了每种格式的存储空间:

storage_cms = num_rows * num_cols * element_size
storage_csc = num_nnz * element_size + num_nnz * index_size + num_cols * index_size

For more details about storage formats see wikipedia , or netlib . 有关存储格式的更多详细信息,请参阅wikipedianetlib

Assuming double precision and 32-bit indeces, in your case that means: 假设双精度和32位indeces,在您的情况下,这意味着:

storage_cms = 800MB
storage_csc = 12.04MB

So when you are multiplying a sparse x dense (or dense x sparse) matrix, you are accessing ~812MB of memory, while you only access ~24MB of memory when multiplying sparse x sparse matrix. 因此,当您将稀疏x密集(或密集x稀疏)矩阵相乘时,您正在访问~812MB的内存,而在乘以稀疏x稀疏矩阵时,您只能访问~24MB的内存。

Note that this doesn't include the memory where you write the results, and this can be a significant portion (up to ~800MB in both cases), but I am not very familiar with Armadillo and which algorithm it uses for matrix multiplication, so cannot exactly say how it stores the intermediate results. 请注意,这不包括写入结果的内存,这可能是一个重要部分(在两种情况下都高达~800MB),但我对Armadillo及其用于矩阵乘法的算法不是很熟悉,所以不能确切地说它如何存储中间结果。

Whatever the algorithm, it definitely needs to access both input matrices multiple times, which explains why converting a dense matrix to sparse (which requires only one access to the 800MB of dense matrix), and then doing a sparse x sparse product (which requires accessing 24MB of memory multiple times) is more efficient than dense x sparse and sparse x dense product. 无论算法是什么,它肯定需要多次访问两个输入矩阵,这解释了为什么将密集矩阵转换为稀疏矩阵(只需要访问800MB密集矩阵),然后执行稀疏x稀疏产品(需要访问)多次24MB内存)比密集x稀疏和稀疏x密集产品更有效。

There are also all sorts of cache effects here, which would require the knowledge of the exact implementation of the algorithm and the hardware (and a lot of time) to explain properly, but above is the general idea. 这里还有各种缓存效果,这需要知道算法的精确实现和硬件(以及大量时间)才能正确解释,但上面是一般的想法。

As for why is dense x sparse faster than sparse x dense , it is because of the CSC storage format for sparse matrices. 至于为什么dense x sparsesparse x dense更快,这是因为稀疏矩阵的CSC存储格式。 As noted in scipy's documentation , CSC format is efficient for column slicing, and slow for row slicing. 正如scipy的文档中所述 ,CSC格式对于列切片是有效的,而对于行切片来说则很慢。 dense x sparse multiplication algorithms need column slicing of the sparse matrix, and sparse x dense need row slicing of the sparse matrix. dense x sparse乘法算法需要稀疏矩阵的列切片,而sparse x dense需要稀疏矩阵的行切片。 Note that if armadillo used CSR instead of CSC, sparse x dense would be efficient, and dense x sparse wouldn't. 请注意,如果犰狳使用CSR而不是CSC,则sparse x dense将是有效的,而dense x sparse不会。

I am aware that this is not a complete answer of all the performance effects you are seeing, but should give you a general idea of what is happening. 我知道这不是你所看到的所有性能影响的完整答案,但是应该让你大致了解正在发生的事情。 A proper analysis would require a lot more time and effort to do, and would have to include concrete implementations of the algorithms, and information about the hardware on which it is run. 正确的分析需要花费更多的时间和精力,并且必须包括算法的具体实现,以及有关运行它的硬件的信息。

This should be fixed in the upcoming Armadillo 8.500 , which will be wrapped in RcppArmadillo 0.8.5 Real Soon Now. 这应该在即将推出的Armadillo 8.500中修复,它将包含在RcppArmadillo 0.8.5 Real Soon Now中。 Specifically: 特别:

  • sparse matrix transpose is much faster 稀疏矩阵转置要快得多
  • (sparse x dense) reimplemented as ((dense^T) x (sparse^T))^T , taking advantage of the relatively speedy (dense x sparse) code (sparse x dense)重新实现为((dense^T) x (sparse^T))^T ,利用相对快速(dense x sparse)代码

When I tested it, the time taken dropped from ~500 seconds to about 18 seconds, which is comparable to the other timings. 当我测试它时,所花费的时间从约500秒减少到约18秒,这与其他时间相当。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM