简体繁体 English

稀疏矩阵与密集矩阵乘法 C++ Tensorflow

[英]Sparse Matrix Vs Dense Matrix Multiplication C++ Tensorflow

原文 2020-07-28 05:32:16 3 1 c++/ sparse-matrix/ matrix-multiplication

I would like to write in C++ Tensorflow sparse matrix dense vector (SPMv) multiplication: y = Ax我想写在 C++ Tensorflow 稀疏矩阵密集向量（SPMv）乘法：y = Ax

The sparse matrix, A, is stored in CSR format.稀疏矩阵 A 以 CSR 格式存储。 The usual sparsity of A is between 50-90%. A 的通常稀疏度在 50-90% 之间。 The goal is to reach better or similar time than that of dense matrix dense vector (DMv) multiplication.目标是达到比密集矩阵密集向量 (DMv) 乘法更好或相似的时间。

Please note that I have already viewed the following posts: Q1 Q2 Q3 .请注意，我已经查看了以下帖子： Q1 Q2 Q3 。 However, I still am wondering about the following:但是，我仍然想知道以下几点：

How does SPMv multiplication compare in terms of time to DMv? SPMv 乘法在时间方面与 DMv 相比如何？ Since sparsity is relatively high, I assume that SPMv should be better given the reduction in the number of operations - Yes?由于稀疏性相对较高，我认为 SPMv 应该更好，因为减少了操作数量 - 是吗？
What should I take into to account to make SpMv the same or better in terms of time than the DMv?为了使 SpMv 在时间上与 DMv 相同或更好，我应该考虑什么？ Why ppl are saying that the DMv will perform petter than SPMv?为什么有人说 DMv 会比 SPMv 表现得更好？ Does the storage representation make a difference?存储表示是否有所不同？
Any recommended libraries that do SPMv in C++ for either CPU or GPU implementation.在 C++ 中为 CPU 或 GPU 实现执行 SPMv 的任何推荐库。

This question is relevant to my other question here: ( CSCC: Convolution Split Compression Calculation Algorithm for Deep Neural Network )这个问题与我在这里的另一个问题有关：（ CSCC：深度神经网络的卷积拆分压缩计算算法）

1 个解决方案

To answer the edited question:要回答已编辑的问题：

Unless the Matrix is very sparse (<10% nonzeros on CPU, probably <1% on GPU), you will likely not benefit from the sparsity.除非矩阵非常稀疏（CPU 上的非零值 <10%，GPU 上的非零值可能 <1%），否则您可能不会从稀疏性中受益。 While the number of floating point operations is reduced, the amount of storage is at least double (column or row index + value), memory access is irregular (you have an indirection via the index for the right-hand side), it becomes far more difficult to vectorize (or to achieve coalescing on the GPU) and if you parallelize you have to deal with the fact that rows are of varying length and therefore a static schedule is likely to be suboptimal.虽然减少了浮点运算的数量，但存储量至少增加了一倍（列或行索引 + 值），memory 访问是不规则的（您可以通过右侧的索引进行间接访问），它变得很远更难以矢量化（或在 GPU 上实现合并），如果您进行并行化，则必须处理行长度不同的事实，因此 static 计划可能不是最佳的。
Beyond the points above, yes, the storage representation matters.除了以上几点，是的，存储表示很重要。 For example a COO-matrix stores two indices and the value, while CSR/CSC only store one but require an additional offset array which makes them more complex to build on the fly.例如，COO 矩阵存储两个索引和值，而 CSR/CSC 只存储一个但需要一个额外的偏移数组，这使得它们在运行中构建起来更加复杂。 Especially on the GPU, storage formats matter if you want to at least achieve some coalescing.特别是在 GPU 上，如果您想至少实现一些合并，存储格式很重要。 This paper looks into how storage formats affect performance on the GPU: https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957本文研究了存储格式如何影响 GPU 的性能： https://onlinelibrary.wiley.com/doi/full/10.1111/cgf.13957
For something generic try Eigen or cuSparse on GPU.对于通用的尝试 GPU 上的Eigen或cuSparse 。 There are plenty of others that perform better for specific use cases, but this part of the question isn't clearly answerable.对于特定用例，还有很多其他的表现更好，但这部分问题并没有明确的答案。

Beyond the matrix format itself, even the ordering of entries in your matrix can have a massive impact on performance, which is why the Cuthill-McKee algorithm is often used to reduce matrix bandwidth (and thereby improve cache performance).除了矩阵格式本身之外，甚至矩阵中条目的顺序也会对性能产生巨大影响，这就是为什么 Cuthill-McKee 算法经常用于减少矩阵带宽（从而提高缓存性能）的原因。