简体   繁体   English

稀疏x密集矩阵乘法性能低效

[英]sparse x dense matrix multiplication performance under-efficient

Context : I am using Eigen for Artificial Neural Network where the typical dimensions are around 1000 nodes per layer. 背景 :我使用Eigen进行人工神经网络,其中典型的维度是每层大约1000个节点。 So most of the operations are to multiplying matrix M of size ~(1000,1000) with a vector of size 1000 or a batch of B vectors, which are represented as matrices of size Bx1000. 因此,大多数操作是将大小为〜(1000,1000)的矩阵M与大小为1000的向量或一批B向量相乘, 其被表示为大小为Bx1000的矩阵

After training a neural network, I am using pruning - which is a common compression technique which ends up with sparse matrix (density of non empty parameters between 10 and 50%). 在训练神经网络之后,我正在使用修剪 - 这是一种常见的压缩技术,最终得到稀疏矩阵(非空参数的密度在10%到50%之间)。

Goal : I would like to use sparse matrix for compression purpose and secondarily for performance optimization but it is not the main goal 目标 :我想使用稀疏矩阵进行压缩,其次是性能优化,但这不是主要目标

Issue : I am comparing performance of sparse and dense matrix multiplication (only multiplication time is computed) for different batch sizes and I am observing the following (Using Eigen 3.2.8, MacBook Pro 64bits, without open_mp, and using standard g++): 问题 :我正在比较不同批量大小的稀疏和密集矩阵乘法(仅计算乘法时间)的性能,我观察以下(使用Eigen 3.2.8,MacBook Pro 64位,没有open_mp,并使用标准g ++):

  • when B=1 (Matrix x Vector) - sparse matrix operations with density 10% or 30% is more efficient than dense matrix operations - which seems expected result: far less operations are performed 当B = 1(矩阵x向量)时 - 密度为10%或30%的稀疏矩阵运算比密集矩阵运算更有效 - 这似乎是预期的结果:执行的操作少得多
  • for B=32: 对于B = 32:
    • the time needed for dense matrix operation is only ~10 times the time need for B=1 - which is cool - does it shows some vectorization effect? 密集矩阵运算所需的时间只是B = 1所需时间的10倍 - 这很酷 - 它是否显示出一些矢量化效应?
    • the time needed for sparse matrix operation is 67 times the time needed for B=1 - which means that it is less efficient than processing the 32 vectors independently 稀疏矩阵运算所需的时间是B = 1所需时间的67倍 - 这意味着它比独立处理32个向量的效率低

MxN multiplication time (ms) for M sparse/dense, and N of size 1000xB M稀疏/密集的MxN乘法时间(ms),以及大小为1000xB的N.

Same numbers but showing the time per vector in a batch of different size for sparse and dense matrix. 相同的数字,但显示稀疏和密集矩阵的一批不同大小的每个矢量的时间。 We see clearly the decrease of time for dense matrix when batch size increase, and the augmentation for sparse matrix showing some wrong. 当批量大小增加时,我们清楚地看到密集矩阵的时间减少,并且稀疏矩阵的增强显示出一些错误。 Normalized with time for B=1 B = 1时标准化

Code : I am using the following types for sparse and dense matrices: 代码 :我对稀疏和密集矩阵使用以下类型:

typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;

the operation I am benchmarking is the following: 我正在进行基准测试的操作如下:

o.noalias()=m*in.transpose();

where o is a dense matrix (1000xB), m is either a dense matrix (1000x1000) or the corresponding sparse matrix obtained with m.sparseView() , and in is a dense matrix (Bx1000) 其中o是密集矩阵(1000xB), m是密集矩阵(1000x1000)或使用m.sparseView()获得的相应稀疏矩阵, in是密集矩阵(Bx1000)

A full code is below (averaging time for 20 different random matrices, and running each multiplication 50 times) - time for B=32 and B=1 are below. 完整代码如下(20个不同随机矩阵的平均时间,并且每次乘法运行50次) - B = 32且B = 1的时间低于。

Any feedback/intuition is welcome! 欢迎任何反馈/直觉!


batch   1   ratio   0.3 dense   0.32    sparse  0.29
batch   32  ratio   0.3 dense   2.75    sparse  15.01

#include <Eigen/Sparse>
#include <Eigen/Dense>
#include <stdlib.h>
#include <boost/timer/timer.hpp>

using namespace Eigen;
using namespace boost::timer;

typedef SparseMatrix<float> spMatFloat;
typedef Matrix<float, Dynamic, Dynamic, RowMajor> deMatRowFloat;

void bench_Sparse(const spMatFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
  o.noalias()=m*in.transpose();
}

void bench_Dense(const deMatRowFloat &m, const deMatRowFloat &in, deMatRowFloat &o) {
  o.noalias()=m*in.transpose();
}

int main(int argc, const char **argv) {
  float ratio=0.3;
  int iter=20;
  int batch=32;
  float t_dense=0;
  float t_sparse=0;

  deMatRowFloat d_o1(batch,1000);
  deMatRowFloat d_o2(batch,1000);
  for(int k=0; k<iter; k++) {
    deMatRowFloat d_m=deMatRowFloat::Zero(1000,1000);
    deMatRowFloat d_b=deMatRowFloat::Random(batch,1000);
    for(int h=0;h<ratio*1000000;h++) {
      int i=rand()%1000;
      int j=rand()%1000;
      d_m(i,j)=(rand()%1000)/500.-1;
    }
    spMatFloat s_m=d_m.sparseView();
    {
      cpu_timer timer;
      for(int k=0;k<50;k++) bench_Dense(d_m,d_b,d_o1);
      cpu_times const elapsed_times(timer.elapsed());
      nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
      t_dense+=elapsed/1000000.;
    }
    {
      cpu_timer timer;
      for(int k=0;k<50;k++) bench_Sparse(s_m,d_b,d_o2);
      cpu_times const elapsed_times(timer.elapsed());
      nanosecond_type const elapsed(elapsed_times.system+elapsed_times.user);
      t_sparse+=elapsed/1000000.;
    }
  }
  std::cout<<"batch\t"<<batch<<"\tratio\t"<<ratio<<"\tdense\t"<<t_dense/50/iter<<"\tsparse\t"<<t_sparse/50/iter<<std::endl;
}

New Results after ggael suggestion : I tried the different possible combinations and found indeed huge differences of performance when changing M and B RowMajor/ColMajor. ggael建议之后的新结果 :我尝试了不同的可能组合,发现在更改MB RowMajor / ColMajor时确实存在巨大的性能差异。

To summarize I am interested in doing M*B where M is (1000,1000) and B is (1000,batch): I am interested in comparing performance of M sparse/dense and when batch is growing. 总结一下,我感兴趣做M*B ,其中M是(1000,1000), B是(1000,批处理):我有兴趣比较M稀疏/密集的性能和批量增长时的性能。

I tested 3 configurations: 我测试了3种配置:

  • M dense, B dense M密,B密
  • M sparse, B dense M稀疏,B密
  • M sparse, B dense, but the multiplication of M*B is done manually column by column M稀疏,B密集,但M * B的乘法是逐列手动完成的

results are as following - where the number is the ratio time per column for B=32/time for B=1 with matrix M with density 0.3: 结果如下 - 其中数字是每列的比率时间,B = 32 /时间,B = 1,矩阵M,密度为0.3:

这里

The initial reported problem was the worse case (M ColMajor, B RowMajor). 最初报告的问题是更糟糕的情况(M ColMajor,B RowMajor)。 For (M RowMajor, B ColMajor), there is a 5 times speedup between B=32 and B=1 and performance of sparse matrix is almost equivalent to dense matrix. 对于(M RowMajor,B ColMajor),在B = 32和B = 1之间有5倍的加速,并且稀疏矩阵的性能几乎等于密集矩阵。

In Eigen, for dense algebra, both matrix-vector and matrix-matrix products are highly optimized and take full advantage of vectorization. 在Eigen中,对于密集代数,矩阵矢量和矩阵矩阵产品都经过高度优化,并充分利用了矢量化。 As you observed, matrix-matrix products exhibit a much higher efficiency. 正如您所观察到的,基质矩阵产品具有更高的效率。 This is because matrix-matrix products can further optimized by increasing the ratio between the number of arithmetic operations and memory accesses, and by exploiting memory caches. 这是因为通过增加算术运算次数和存储器访问次数之间的比率,以及利用存储器高速缓存,可以进一步优化矩阵矩阵产品。

Then regarding sparse-dense products, there are two strategies: 然后关于稀疏密集型产品,有两种策略:

  1. Process the dense right hand side one column at once, and thus scan the sparse matrix multiple times. 一次处理密集的右侧一列,从而多次扫描稀疏矩阵。 For this strategy, better use a column-major storage for the dense matrices (right-hand side and result). 对于此策略,最好使用密集矩阵的列主存储(右侧和结果)。 In Eigen 3.2, this strategy has be emulated by scanning the columns manually. 在Eigen 3.2中,可以通过手动扫描列来模拟此策略。
  2. Scan the sparse matrix only once, and process the rows of the dense right hand side and results in the most nested loop. 仅扫描稀疏矩阵一次,并处理密集右侧的行,并产生最嵌套的循环。 This is default strategy in Eigen 3.2. 这是Eigen 3.2中的默认策略。 In this case, better use a row-major storage for the dense matrices ( Matrix<float,Dynamic,32,RowMajor> ). 在这种情况下,最好使用密集矩阵的行主存储( Matrix<float,Dynamic,32,RowMajor> )。

Finally, in either case, you could try with both a row-major and column-major storage for the sparse matrix, and figure out which combination of strategy and storage order of the sparse matrix works best in your case. 最后,在任何一种情况下,您都可以尝试使用稀疏矩阵的行主要和列主要存储,并确定稀疏矩阵的策略和存储顺序的哪种组合在您的情况下最有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM