简体   繁体   English

为什么增强矩阵乘法比我的慢?

[英]Why is boosts matrix multiplication slower than mine?

I have implemented one matrix multiplication with boost::numeric::ublas::matrix (see my full, working boost code ) 我用boost::numeric::ublas::matrix实现了一个矩阵乘法(参见我的完整,工作的boost代码

Result result = read ();

boost::numeric::ublas::matrix<int> C;
C = boost::numeric::ublas::prod(result.A, result.B);

and another one with the standard algorithm (see full standard code ): 另一个使用标准算法(参见完整的标准代码 ):

vector< vector<int> > ijkalgorithm(vector< vector<int> > A, 
                                    vector< vector<int> > B) {
    int n = A.size();

    // initialise C with 0s
    vector<int> tmp(n, 0);
    vector< vector<int> > C(n, tmp);

    for (int i = 0; i < n; i++) {
        for (int k = 0; k < n; k++) {
            for (int j = 0; j < n; j++) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
    return C;
}

This is how I test the speed: 这就是我测试速度的方法:

time boostImplementation.out > boostResult.txt
diff boostResult.txt correctResult.txt

time simpleImplementation.out > simpleResult.txt
diff simpleResult.txt correctResult.txt

Both programs read a hard-coded textfile which contains two 2000 x 2000 matrices. 两个程序都读取了一个包含两个2000 x 2000矩阵的硬编码文本文件。 Both programs were compiled with these flags: 这两个程序都是用这些标志编译的:

g++ -std=c++98 -Wall -O3 -g $(PROBLEM).cpp -o $(PROBLEM).out -pedantic

I got 15 seconds for my implementation and over 4 minutes for the boost-implementation! 我的实现时间15秒 ,提升实施时间超过4分钟

edit: After compiling it with 编辑:编译后

g++ -std=c++98 -Wall -pedantic -O3 -D NDEBUG -DBOOST_UBLAS_NDEBUG library-boost.cpp -o library-boost.out

I got 28.19 seconds for the ikj-algorithm and 60.99 seconds for Boost. ikj算法得到28.19 ,Boost得到60.99秒 So Boost is still considerably slower. 所以Boost仍然相当慢。

Why is boost so much slower than my implementation? 为什么提升比我的实现慢得多?

Slower performance of the uBLAS version can be partly explained by debugging features of the latter as was pointed out by TJD. 如TJD所指出的那样,可以通过调试后者的功能来部分地解释uBLAS版本的较慢性能。

Here's the time taken by the uBLAS version with debugging on: 以下是uBLAS版本调试时间:

real    0m19.966s
user    0m19.809s
sys     0m0.112s

Here's the time taken by the uBLAS version with debugging off ( -DNDEBUG -DBOOST_UBLAS_NDEBUG compiler flags added): 这是uBLAS版本关闭调试所用的时间(添加了-DNDEBUG -DBOOST_UBLAS_NDEBUG编译器标志):

real    0m7.061s
user    0m6.936s
sys     0m0.096s

So with debugging off, uBLAS version is almost 3 times faster. 因此,关闭调试,uBLAS版本几乎快3倍。

Remaining performance difference can be explained by quoting the following section of uBLAS FAQ "Why is uBLAS so much slower than (atlas-)BLAS": 剩余的性能差异可以通过引用uBLAS FAQ的以下部分来解释“为什么uBLAS比(atlas-)BLAS慢得多”:

An important design goal of ublas is to be as general as possible. ublas的一个重要设计目标是尽可能通用。

This generality almost always comes with a cost. 这种普遍性几乎总是带来成本。 In particular the prod function template can handle different types of matrices, such as sparse or triangular ones. 特别地, prod函数模板可以处理不同类型的矩阵,例如稀疏或三角形矩阵。 Fortunately uBLAS provides alternatives optimized for dense matrix multiplication, in particular, axpy_prod and block_prod . 幸运的是,uBLAS提供了针对密集矩阵乘法优化的替代方案,特别是axpy_prodblock_prod Here are the results of comparing different methods: 以下是比较不同方法的结果:

ijkalgorithm   prod   axpy_prod  block_prod
   1.335       7.061    1.330       1.278

As you can see both axpy_prod and block_prod are somewhat faster than your implementation. 正如您所看到的, axpy_prodblock_prod都比您的实现快一些。 Measuring just the computation time without I/O, removing unnecessary copying and careful choice of the block size for block_prod (I used 64) can make the difference more profound. 只测量没有I / O的计算时间,去除不必要的复制并仔细选择block_prod的块大小(我使用64)可以使差异更加深刻。

See also uBLAS FAQ and Effective uBlas and general code optimization . 另请参阅uBLAS常见问题解答有效的uBlas以及常规代码优化

I believe, your compiler doesn't optimize enough. 我相信,你的编译器没有足够的优化。 uBLAS code makes heavy use of templates and templates require heavy use of optimizations. uBLAS代码大量使用模板和模板需要大量使用优化。 I ran your code through MS VC 7.1 compiler in release mode for 1000x1000 matrices, it gives me 我在发布模式下通过MS VC 7.1编译器为1000x1000矩阵运行代码,它给了我

10.064 s for uBLAS 10.064 S代表的uBLAS

7.851 s for vector 矢量7.851

The difference is still there, but by no means overwhelming. 差异仍然存在,但绝不是压倒性的。 uBLAS's core concept is lazy evaluation, so prod(A, B) evaluates results only when needed, eg prod(A, B)(10,100) will execute in no time, since only that one element will actually be calculated. uBLAS的核心概念是惰性评估,因此prod(A, B)仅在需要时评估结果,例如prod(A, B)(10,100)将立即执行,因为实际上只会计算一个元素。 As such there's actually no dedicated algorithm for whole matrix multiplication which could be optimized (see below). 因此,实际上没有可以优化的整个矩阵乘法的专用算法 (见下文)。 But you could help the library a little, declaring 但你可以帮助图书馆一点点,宣布

matrix<int, column_major> B;

will reduce running time to 4.426 s which beats your function with one hand tied. 将运行时间减少到4.426秒,这可以用一只手绑住你的功能。 This declaration makes access to memory more sequential when multiplying matrices, optimizing cache usage. 这个声明使得在乘法矩阵时更加顺序地访问内存,从而优化缓存使用。

PS Having read uBLAS documentation to the end ;), you should have found out that there's actually a dedicated function to multiply whole matrices at once. PS已经阅读了uBLAS文档到最后;),您应该已经发现实际上有一个专用函数可以立即乘以整个矩阵。 2 functions - axpy_prod and opb_prod . 2个函数 - axpy_prodopb_prod So 所以

opb_prod(A, B, C, true);

even on unoptimized row_major B matrix executes in 8.091 sec and is on par with your vector algorithm 即使在未优化row_major乙矩阵执行8.091秒,是在同水准与你的载体算法

PPS There's even more optimizations: PPS还有更多优化:

C = block_prod<matrix<int>, 1024>(A, B);

executes in 4.4 s, no matter whether B is column_ or row_ major. 无论B是column_还是row_ major,都会在4.4秒内执行。 Consider the description: "The function block_prod is designed for large dense matrices." 考虑描述:“函数block_prod是为大密集矩阵设计的。” Choose specific tools for specific tasks! 为特定任务选择特定工具!

I created a little website Matrix-Matrix Product Experiments with uBLAS . 用uBLAS创建了一个小网站Matrix-Matrix产品实验 It's about integrating a new implementation for the matrix-matrix product into uBLAS. 它是关于将矩阵矩阵产品的新实现集成到uBLAS中。 If you already have the boost library it only consists of additional 4 files. 如果您已经拥有boost库,则它只包含额外的4个文件。 So it is pretty much self-contained. 所以它几乎是独立的。

I would be interested if others could run the simple benchmarks on different machines. 如果其他人可以在不同的机器上运行简单的基准测试,我会感兴趣。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么矩阵加法比本征矩阵向量乘法慢? - Why Matrix Addition is slower than Matrix-Vector Multiplication in Eigen? 为什么与SSE的矩阵乘法更慢? - Why matrix multiplication with SSE is slower? 为什么Strassen矩阵乘法比标准矩阵乘法慢得多? - Why is Strassen matrix multiplication so much slower than standard matrix multiplication? 对于矩阵乘法,Eigen + MKL比Matlab慢 - Eigen + MKL slower than Matlab for matrix multiplication 为什么在平铺矩阵乘法中CUDA共享内存比全局内存要慢? - Why CUDA shared memory is slower than global memory in tiled matrix multiplication? 为什么简单的 C++ 矩阵乘法比 BLAS 慢 100 倍? - Why is a naïve C++ matrix multiplication 100 times slower than BLAS? 为什么 GNU 科学库矩阵乘法比 numpy.matmul 慢? - Why is the GNU scientific library matrix multiplication slower than numpy.matmul? 在使用显式循环的矩阵乘法中,本征比 Fortran 慢得多 - Eigen is much slower than Fortran in matrix multiplication using an explicit loop 矩阵乘法的特征码比使用 std::vector 的循环乘法运行速度慢 - Eigen code for matrix multiplication running slower than looped multiplication using std::vector 为什么左移运算符比乘法(C ++)慢? - Why shift left operator is slower than multiplication (C++)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM