简体繁体 English

为什么cuSparse比cuBlas慢得多，因为稀疏矩阵乘法

[英]Why cuSparse is much slower than cuBlas for sparse matrix multiplication

原文 2015-05-08 07:55:06 4 1 matrix/ cuda/ multiplication/ sparse-matrix/ cublas

Recently when I used cuSparse and cuBLAS in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSPARSE is much slower than cuBLAS in all cases! 最近当我在CUDA TOOLKIT 6.5中使用cuSparse和cuBLAS进行稀疏矩阵乘法时，我发现cuSPARSE在所有情况下都比cuBLAS慢得多！

In all my experiments, I used cusparseScsrmm in cuSparse and cublasSgemm in cuBLAS. 在我所有的实验中，我用cusparseScsrmm在cuSparse和cublasSgemm在CUBLAS。 In the sparse matrix, half of the total elements are zero. 在稀疏矩阵中，总元素的一半为零。 The GPU I used is NVIDIA Titan Black. 我使用的GPU是NVIDIA Titan Black。 Besides, all the time consumed is obtained through nvvp tool provided by NVIDIA. 此外，所有消耗的时间都是通过NVIDIA提供的nvvp工具获得的。 Below are some of the results: 以下是一些结果：

Experiment A: 实验A：

sparse matrix size: 192x2400 稀疏矩阵大小：192x2400
dense matrix size: 2400x256 密集矩阵大小：2400x256
cusparse time: 1.4ms cusparse时间：1.4ms
cublas time: 0.21ms 古巴拉斯时间：0.21毫秒

Experiment B: 实验B：

sparse matrix size: 192x75 稀疏矩阵大小：192x75
dense matrix size: 75x1024 密集矩阵大小：75x1024
cusparse time: 0.27ms cusparse时间：0.27ms
cublas time: 0.04ms 古巴拉斯时间：0.04ms

So, it's very odd to see the results listed above. 所以，看到上面列出的结果是很奇怪的。 Because cuSPARSE is designed particularly to handle sparse matrix manipulation, how could it be even slower than cuBLAS!? 因为cuSPARSE专门设计用于处理稀疏矩阵操作，所以它怎么能比cuBLAS更慢！？ If so, then there is no need to use cuSPARSE at all. 如果是这样，则根本不需要使用cuSPARSE。 Could you please give me any explanation to the results? 你能告诉我结果的任何解释吗？ Also, could you suggest any other ways to speed up sparse matrix multiplication? 另外，你能否提出任何其他方法来加速稀疏矩阵乘法？

1 个解决方案

I don't think that you can classify a matrix with half zeros as "sparse": the timing you have found are reasonable (actually the sparse algorithm is behaving pretty well!). 我不认为你可以将半零的矩阵归类为“稀疏”：你发现的时间是合理的（实际上稀疏算法的表现非常好！）。

Sparse algorithms are efficient only when considering matrices where most of the elements are zeros (for example, matrices coming out from finite elements problems). 稀疏算法仅在考虑大多数元素为零的矩阵时才有效（例如，来自有限元问题的矩阵）。

This holds true for CPUs, non only for GPUs: there's an important overhead in treating the matrix as sparse, and it become convenient to use sparse algorithms only when... most of the elements are zeros (typical: ten or less non-zeros per row, matrix of rank thousands - hundred thousands - (millions?) ). 这适用于CPU，不仅适用于GPU：将矩阵视为稀疏处理有一个重要的开销，只有当......大多数元素为零时才使用稀疏算法变得很方便（典型：十个或更少的非零）每行，排名数千 - 数十万 - （数百万？））。

There are other matrix shapes that have efficient solution algorithms, that you can try if it applies to your problem, eg band matrices. 还有其他矩阵形状具有有效的解决方案算法，如果它适用于您的问题，您可以尝试，例如波段矩阵。 I don't know whether they have been ported to cuBlas though. 我不知道他们是否已被移植到cuBlas。

About the overheads 关于管理费用

Dense linear algebra algorithms can perform optimally because processors have been designed in order to best efficiently solve for such systems. 密集线性代数算法可以最佳地执行，因为处理器的设计是为了最有效地解决这样的系统。 Consider the DGEMM operation (matrix-matrix multiply): it's an operation that let you use the processors at >95% of it's theoretical peak floating point performance, for large matrices (ie, matrices not fitting any cache of the system). 考虑DGEMM操作（矩阵 - 矩阵乘法）：对于大型矩阵（即，矩阵不适合系统的任何高速缓存），它允许您使用理论峰值浮点性能的95％以上的处理器。 How? 怎么样？

prefetching 预取
optimal cache usage 最佳缓存使用率
vectorization (SSE, AVX) 矢量化（SSE，AVX）
pipelining 流水线

In a sparse LA algorithm only non-zero elements and their corresponding indexes are stored into memory: memory accesses are in fact indirect . 在稀疏的LA算法中，只有非零元素及其相应的索引存储在存储器中：存储器访问实际上是间接的 。 So the sparse algorithm cannot exploit the hardware at the same level of optimization: I don't know about specific figures in this context, but 10 to 20% wouldn't be strange. 因此，稀疏算法无法在相同的优化级别上利用硬件：我不知道在这种情况下的具体数字，但10％到20％不会很奇怪。

The gain is clearly that operations on zeros (on non-stored elements) are simply not performed, resulting in order of magnitudes less operations and much less needed storage. 显而易见的是，对零的操作（在非存储元件上）根本不执行，导致操作量减少，所需存储量减少。

There are further overheads in integers logics, conditionals, but modern CPUs are pretty good in overlapping integer and FP operations, and with "speculative executions". 整数逻辑，条件有进一步的开销，但现代CPU在重叠整数和FP操作以及“推测执行”方面相当不错。 Unfortunately they too can prevent vectorization and so are further overheads with respect to the dense case. 不幸的是，他们也可以防止矢量化，因此对于密集的情况也是如此。

What about GPUs? GPU怎么样？

Dense LA algorithm are an optimal fit for GPUs as the same as for CPUs: in this case you have optimal usage of: 密集LA算法是GPU的最佳选择，与CPU相同：在这种情况下，您可以优化使用：

coalescing 合并
shared memory 共享内存
memory access patterns 内存访问模式

Again the indirect access to matrices elements in sparse LA algorithm prevent to exploit the same level of optimization. 再次，稀疏LA算法中对矩阵元素的间接访问阻止了利用相同级别的优化。

References 参考

I can't remember which one I used when encountered sparse problems... I think it was PSBLAS: http://people.uniroma2.it/salvatore.filippone/psblas/ 我不记得在遇到稀疏问题时我使用了哪一个......我认为这是PSBLAS： http ：//people.uniroma2.it/salvatore.filippone/psblas/

But here you will be overwhelmed of them: http://www.netlib.org/utk/people/JackDongarra/la-sw.html 但在这里你将被他们所淹没： http ： //www.netlib.org/utk/people/JackDongarra/la-sw.html