对于矩阵乘法，Eigen + MKL比Matlab慢

Question

I am doing a lot of matrix multiplications in a C++ program and I use Eigen (3.3.5) linked with Intel's MKL (2018.3.222). 我在C ++程序中进行了大量的矩阵乘法，并使用与英特尔MKL（2018.3.222）链接的Eigen（3.3.5）。 I use the sequential version of the MKL and OpenMP is disabled. 我使用MKL的顺序版本，并禁用OpenMP。 The problem is that it is slower than Matlab. 问题是它比Matlab慢。

Some example code: 一些示例代码：

#define NDEBUG
#define EIGEN_USE_MKL_ALL

#include <iostream>
#include <chrono>
#include <Core>

using namespace Eigen;
using namespace std;

int main(){
    MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
    MatrixXd res = MatrixXd::Zero(2850, 2850);

    for (int i=0; i<10; i++){
        auto begin = chrono::high_resolution_clock::now();
        res.noalias() = jac.transpose()*jac;
        auto end = chrono::high_resolution_clock::now();

        cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;
    }

    return 0;
}

It reports about 8 seconds on average. 它平均报告大约8秒。 Compiled with -O3 and no debug symbols on Ubuntu 16.04 with g++ 6.4. 用-O3编译，在Ubuntu 16.04上用g ++ 6.4编译没有调试符号。

The Matlab code: Matlab代码：

m=100*(-1+2*rand(10*1228, 2850));
res = zeros(2850, 2850);
tic; res=m'*m; toc

It reports ~4 seconds, which is two times faster. 它报告约4秒，这是两倍快。 I used Matlab R2017a on the same system with maxNumCompThreads(1). 我在与maxNumCompThreads（1）相同的系统上使用了Matlab R2017a。 Matlab uses MKL 11.3. Matlab使用MKL 11.3。

Without MKL and using only Eigen, it takes about 18s. 如果没有MKL并仅使用Eigen，则需要大约18秒。 What can I do to bring the C++ running time down to the same value as Matlab's? 我该怎么做才能将C ++的运行时间降低到与Matlab相同的值？ Thank you. 谢谢。

Later Edit: As @Qubit suggested, Matlab recognises that I am trying to multiply a matrix with its transpose and does some 'hidden' optimization. 稍后编辑：正如@Qubit建议的那样，Matlab认识到我正在尝试将矩阵与其转置相乘并进行一些“隐藏”优化。 When I multiplied two different matrices in Matlab, the time went up to those 8 seconds. 当我在Matlab中乘以两个不同的矩阵时，时间上升到那8秒。 So, now the problem becomes: how can I tell Eigen that this matrix product is 'special' and could be optimized further? 所以，现在问题变成了：我怎么能告诉Eigen这个矩阵产品是“特殊的”并且可以进一步优化？

Later Edit 2: I tried doing it like this: 后来编辑2：我尝试这样做：

MatrixXd jac = 100*MatrixXd::Random(10*1228, 2850);
MatrixXd res = MatrixXd::Zero(2850, 2850);

auto begin = chrono::high_resolution_clock::now();
res.selfadjointView<Lower>().rankUpdate(jac.transpose(), 1);
res.triangularView<Upper>() = res.transpose();
auto end = chrono::high_resolution_clock::now();

MatrixXd oldSchool = jac.transpose()*jac;
if (oldSchool.isApprox(res)){
    cout<<"same result!"<<endl;
}
cout<<"time: "<<chrono::duration_cast<chrono::milliseconds>(end-begin).count() <<endl;

but now it takes 9.4 seconds (which is half of the time Eigen with no MKL requires for the classic product). 但现在需要9.4秒（这是经典产品没有MKL所需的Eigen的一半）。 Disabling the MKL has no time effect on this timing, therefore I believe the 'rankUpdate' method does not use MKL ?!? 禁用MKL对此时间没有时间影响，因此我认为'rankUpdate'方法不使用MKL？！？

Last EDIT: I have found a bug in eigen header file: 上次编辑：我在eigen头文件中发现了一个错误：

Core/products/GeneralMatrixMatrixTriangular_BLAS.h

at line 55. There was a misplaced parenthesis. 在第55行。有一个错位的括号。 I changed this: 我改变了这个：

if ( lhs==rhs && ((UpLo&(Lower|Upper)==UpLo)) ) { \

to this: 对此：

if ( lhs==rhs && ((UpLo&(Lower|Upper))==UpLo) ) { \

Now, my C++ version and Matlab have the same execution speed (of ~4 seconds on my system). 现在，我的C ++版本和Matlab具有相同的执行速度（在我的系统上约为4秒）。

Answer 1

To really an answer since you already figured out the issues, but some comments: 真的是答案，因为你已经找到了问题，但有些意见：

The issue Core/products/GeneralMatrixMatrixTriangular_BLAS.h was already fixed in the devel branch, but it turns out it has never been brackported to the 3.3 branch. 问题Core/products/GeneralMatrixMatrixTriangular_BLAS.h已经在devel分支中得到修复，但事实证明它从未被包含到3.3分支中。
The issue is now fixed in the 3.3 branch. 这个问题现在已在3.3分支中修复。 The fix will be part of 3.3.6. 修复将是3.3.6的一部分。
A speedup factor x2 between built-in Eigen and MKL in single thread mode does not make sense. 单线程模式下内置Eigen和MKL之间的加速因子x2没有意义。 Make sure to enable all features your CPU support by compiling with -march=native in addition to -O3 -DNDEBUG . 除了-O3 -DNDEBUG之外，通过使用-march=native进行编译，确保启用CPU支持的所有功能。 On my Haswell 2.6GHz I get 3.4s vs 3s. 在我的Haswell 2.6GHz上，我得到3.4s vs 3s。

对于矩阵乘法，Eigen + MKL比Matlab慢

问题描述

1 个解决方案

解决方案1
1 2018-08-16 16:26:44

对于矩阵乘法，Eigen + MKL比Matlab慢

问题描述

1 个解决方案

解决方案1 1 2018-08-16 16:26:44

解决方案1
1 2018-08-16 16:26:44