为什么增强矩阵乘法比我的慢？

Question

I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code ) 我用boost::numeric::ublas::matrix实现了一个矩阵乘法（参见我的完整，工作的boost代码）

Result result = read ();

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);

and another one with the standard algorithm (see full standard code ): 另一个使用标准算法（参见完整的标准代码）：

vector< vector<int> > ijkalgorithm(vector< vector<int> > A, 
                                    vector< vector<int> > B) {
    int n = A.size();

    // initialise C with 0s
    vector<int> tmp(n, 0);
    vector< vector<int> > C(n, tmp);

    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

This is how I test the speed: 这就是我测试速度的方法：

time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt

time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt

Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. 两个程序都读取了一个包含两个2000 x 2000矩阵的硬编码文本文件。 Both programs were compiled with these flags: 这两个程序都是用这些标志编译的：

g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

I got 15 seconds for my implementation and over 4 minutes for the boost-implementation! 我的实现时间为15秒 ，提升实施时间超过4分钟 ！

edit: After compiling it with 编辑：编译后

g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. ikj算法得到28.19 秒，Boost得到60.99秒 。 So Boost is still considerably slower. 所以Boost仍然相当慢。

Why is boost so much slower than my implementation? 为什么提升比我的实现慢得多？

Answer 1

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD. 如TJD所指出的那样，可以通过调试后者的功能来部分地解释uBLAS版本的较慢性能。

Here's the time taken by the uBLAS version with debugging on: 以下是uBLAS版本调试时间：

real    0m19.966s
user    0m19.809s
sys     0m0.112s

Here's the time taken by the uBLAS version with debugging off ( -DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added): 这是uBLAS版本关闭调试所用的时间（添加了-DNDEBUG -DBOOST_UBLAS_NDEBUG编译器标志）：

real    0m7.061s
user    0m6.936s
sys     0m0.096s

So with debugging off, uBLAS version is almost 3 times faster. 因此，关闭调试，uBLAS版本几乎快3倍。

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS": 剩余的性能差异可以通过引用uBLAS FAQ的以下部分来解释“为什么uBLAS比（atlas-）BLAS慢得多”：

An important design goal of ublas is to be as general as possible. ublas的一个重要设计目标是尽可能通用。

This generality almost always comes with a cost. 这种普遍性几乎总是带来成本。 In particular the prod function template can handle different types of matrices, such as sparse or triangular ones. 特别地， prod函数模板可以处理不同类型的矩阵，例如稀疏或三角形矩阵。 Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod . 幸运的是，uBLAS提供了针对密集矩阵乘法优化的替代方案，特别是axpy_prod和block_prod 。 Here are the results of comparing different methods: 以下是比较不同方法的结果：

ijkalgorithm   prod   axpy_prod  block_prod
   1.335       7.061    1.330       1.278

As you can see both axpy_prod and block_prod are somewhat faster than your implementation. 正如您所看到的， axpy_prod和block_prod都比您的实现快一些。 Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod (I used 64) can make the difference more profound. 只测量没有I / O的计算时间，去除不必要的复制并仔细选择block_prod的块大小（我使用64）可以使差异更加深刻。

See also uBLAS FAQ and Effective uBlas and general code optimization . 另请参阅uBLAS常见问题解答和有效的uBlas以及常规代码优化。

Answer 2

I believe, your compiler doesn't optimize enough. 我相信，你的编译器没有足够的优化。 uBLAS code makes heavy use of templates and templates require heavy use of optimizations. uBLAS代码大量使用模板和模板需要大量使用优化。 I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me 我在发布模式下通过MS VC 7.1编译器为1000x1000矩阵运行代码，它给了我

10.064 s for uBLAS 10.064 S代表的uBLAS

7.851 s for vector 矢量7.851秒

The difference is still there, but by no means overwhelming. 差异仍然存在，但绝不是压倒性的。 uBLAS's core concept is lazy evaluation, so prod(A, B) evaluates results only when needed, eg prod(A, B)(10,100) will execute in no time, since only that one element will actually be calculated. uBLAS的核心概念是惰性评估，因此prod(A, B)仅在需要时评估结果，例如prod(A, B)(10,100)将立即执行，因为实际上只会计算一个元素。 As such there's actually ~~no dedicated algorithm for whole matrix multiplication which could be optimized~~ (see below). 因此，实际上~~没有可以优化的整个矩阵乘法的专用算法~~ （见下文）。 But you could help the library a little, declaring 但你可以帮助图书馆一点点，宣布

matrix<int, column_major> B;

will reduce running time to 4.426 s which beats your function with one hand tied. 将运行时间减少到4.426秒，这可以用一只手绑住你的功能。 This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage. 这个声明使得在乘法矩阵时更加顺序地访问内存，从而优化缓存使用。

PS Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply whole matrices at once. PS已经阅读了uBLAS文档到最后;），您应该已经发现实际上有一个专用函数可以立即乘以整个矩阵。 2 functions - axpy_prod and opb_prod . 2个函数 - axpy_prod和opb_prod 。 So 所以

opb_prod(A, B, C, true);

even on unoptimized row_major B matrix executes in 8.091 sec and is on par with your vector algorithm 即使在未优化row_major乙矩阵执行8.091秒，是在同水准与你的载体算法

PPS There's even more optimizations: PPS还有更多优化：

C = block_prod<matrix<int>, 1024>(A, B);

executes in 4.4 s, no matter whether B is column_ or row_ major. 无论B是column_还是row_ major，都会在4.4秒内执行。 Consider the description: "The function block_prod is designed for large dense matrices." 考虑描述：“函数block_prod是为大密集矩阵设计的。” Choose specific tools for specific tasks! 为特定任务选择特定工具！

Answer 3

I created a little website Matrix-Matrix Product Experiments with uBLAS . 我用uBLAS创建了一个小网站Matrix-Matrix产品实验。 It's about integrating a new implementation for the matrix-matrix product into uBLAS. 它是关于将矩阵矩阵产品的新实现集成到uBLAS中。 If you already have the boost library it only consists of additional 4 files. 如果您已经拥有boost库，则它只包含额外的4个文件。 So it is pretty much self-contained. 所以它几乎是独立的。

I would be interested if others could run the simple benchmarks on different machines. 如果其他人可以在不同的机器上运行简单的基准测试，我会感兴趣。

为什么增强矩阵乘法比我的慢？

问题描述

3 个解决方案

解决方案1
47 已采纳 2012-06-19 23:36:57

解决方案2
13 2012-06-21 11:23:50

解决方案3
2 2016-01-22 22:45:32

为什么增强矩阵乘法比我的慢？

问题描述

3 个解决方案

解决方案1 47 已采纳 2012-06-19 23:36:57

解决方案2 13 2012-06-21 11:23:50

解决方案3 2 2016-01-22 22:45:32

解决方案1
47 已采纳 2012-06-19 23:36:57

解决方案2
13 2012-06-21 11:23:50

解决方案3
2 2016-01-22 22:45:32