简体   繁体   中英

Eigen + MKL slower than Matlab for matrix multiplication

I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). I use the sequential version of the MKL and OpenMP is disabled. The problem is that it is slower than Matlab.

Some example code:

#define NDEBUG
#define EIGEN_USE_MKL_ALL

#include <iostream>
#include <chrono>
#include <Core>

using namespace Eigen;
using namespace std;

int main(){
    MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
    MatrixXd res = MatrixXd::Zero(2850, 2850);

    for (int i=0; i<10; i++){
        auto begin = chrono::high_resolution_clock::now();
        res.noalias() = jac.transpose()*jac;
        auto end = chrono::high_resolution_clock::now();

        cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
    }

    return 0;
}

It reports about 8 seconds on average. Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4.

The Matlab code:

m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc

It reports ~4 seconds, which is two times faster. I used Matlab R2017a on the same system with maxNumCompThreads(1). Matlab uses MKL 11.3.

Without MKL and using only Eigen, it takes about 18s. What can I do to bring the C++ running time down to the same value as Matlab's? Thank you.

Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further?

Later Edit 2: I tried doing it like this:

MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);

auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();

MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
    cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;

but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!?

Last EDIT: I have found a bug in eigen header file:

Core/products/GeneralMatrixMatrixTriangular_BLAS.h

at line 55. There was a misplaced parenthesis. I changed this:

if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \

to this:

if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \

Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system).

To really an answer since you already figured out the issues, but some comments:

  1. The issue Core/products/GeneralMatrixMatrixTriangular_BLAS.h was already fixed in the devel branch, but it turns out it has never been brackported to the 3.3 branch.

  2. The issue is now fixed in the 3.3 branch. The fix will be part of 3.3.6.

  3. A speedup factor x2 between built-in Eigen and MKL in single thread mode does not make sense. Make sure to enable all features your CPU support by compiling with -march=native in addition to -O3 -DNDEBUG . On my Haswell 2.6GHz I get 3.4s vs 3s.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM