性能：Matlab与C ++矩阵向量乘法

Question

Preamble 前言

Some time ago I asked a question about performance of Matlab vs Python ( Performance: Matlab vs Python ). 前段时间我问了一个关于Matlab与Python 性能的问题（性能：Matlab与Python ）。 I was surprised that Matlab is faster than Python, especially in meshgrid . 我很惊讶Matlab比Python更快，特别是在meshgrid 。 In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. 在讨论这个问题时，我指出我应该使用Python中的包装器来调用我的C ++代码，因为我也可以使用C ++代码。 I have the same code in C++, Matlab and Python. 我在C ++，Matlab和Python中使用相同的代码。

While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. 在这样做时，我再次惊讶地发现Matlab在矩阵汇编和计算中比C ++更快。我有一个稍大的代码，我正在研究一段矩阵向量乘法。 The larger code performs such multiplications at multiple instances. 较大的代码在多个实例处执行这样的乘法。 Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom). 总体而言，C ++中的代码比Matlab快得多（因为Matlab中的函数调用有开销等），但Matlab似乎在矩阵向量乘法（底部的代码片段）中表现优于C ++。

Results 结果

The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. 下表显示了组装内核矩阵所需的时间与将矩阵与向量相乘所需的时间的比较。 The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. 结果编译为矩阵大小NxN ，其中N在10,000到40,000之间变化。 Which is not that large. 哪个不是那么大。 But the interesting thing is that Matlab outperforms C++ the larger the N gets. 但有趣的是，Matlab的表现优于C ++， N越大。 Matlab is 3.8 - 5.8 times faster in total time. Matlab的总时间快3.8到5.8倍。 Moreover it is also faster in both matrix assembly and computation. 此外，它在矩阵组装和计算中也更快。

 ___________________________________________
|N=10,000   Assembly    Computation  Total  |
|MATLAB     0.3387      0.031        0.3697 |
|C++        1.15        0.24         1.4    |
|Times faster                        3.8    |
 ___________________________________________ 
|N=20,000   Assembly    Computation  Total  |
|MATLAB     1.089       0.0977       1.187  |
|C++        5.1         1.03         6.13   |
|Times faster                        5.2    |
 ___________________________________________
|N=40,000   Assembly    Computation  Total  |
|MATLAB     4.31        0.348        4.655  |
|C++        23.25       3.91         27.16  |
|Times faster                        5.8    |
 -------------------------------------------

Question 题

Is there a faster way of doing this in C++? 在C ++中有更快的方法吗？ Am I missing something? 我错过了什么吗？ I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid . 我知道C ++正在使用for循环，但我的理解是Matlab也会在meshgrid做类似的meshgrid 。

Code Snippets 代码片段

Matlab Code: Matlab代码：

%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data       = load('Input/input.txt');
location   = Data(:,1:2);           
charges    = Data(:,3:end);         
N          = length(location);      
m          = size(charges,2);       

%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1; 
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);

tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);

Class (Using meshgrid): 类（使用meshgrid）：

classdef ex1
    methods 
        function [kernel] = kernel_2D(obj, x,y) 
            [i1,j1] = meshgrid(y(:,1),x(:,1));
            [i2,j2] = meshgrid(y(:,2),x(:,2));
            kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
        end
    end       
end

C++ Code: C ++代码：

EDIT 编辑

Compiled using a make file with following flags: 使用带有以下标志的make文件进行编译：

CC=g++ 
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include 
LDFLAGS= -g -fopenmp  
LIB_PATH= 

SOURCESTEXT= src/read_Location_Charges.cpp 
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
    $(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $@ $(LIB_PATH)
.cpp.o:
    $(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $@

` `

# include"environment.hpp"
using namespace std;
using namespace Eigen;

class ex1 
{
public:
    void kernel_2D(const unsigned long M, double*& x, const unsigned long N,  double*&  y, MatrixXd& kernel)    {   
        kernel  =   MatrixXd::Zero(M,N);
        for(unsigned long i=0;i<M;++i)  {
            for(unsigned long j=0;j<N;++j)  {
                        double X =   (x[0*N+i] - y[0*N+j]) ;
                        double Y =   (x[1*N+i] - y[1*N+j]) ;
                        kernel(i,j) = sqrt((X*X) + (Y*Y));
            }
        }
    }
};

int main()
{
    /* Input ----------------------------------------------------------------------------- */
    unsigned long N = 40000;          unsigned m=1;                   
    double* charges;                  double* location;
    charges =   new double[N * m]();  location =    new double[N * 2]();
    clock_t start;                    clock_t end;
    double exactAssemblyTime;         double exactComputationTime;

    read_Location_Charges ("input/test_input.txt", N, location, m, charges);

    MatrixXd charges_           =   Map<MatrixXd>(charges, N, m);
    MatrixXd Q;
    ex1 Kex1;

    /* Process ------------------------------------------------------------------------ */
    // Matrix assembly
    start = clock();
        Kex1.kernel_2D(N, location, N, location, Q);
    end = clock();
    exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);

    //Computation
    start = clock();
        MatrixXd QH = Q * charges_;
    end = clock();
    exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);

    cout << endl << "Assembly     time: " << exactAssemblyTime << endl;
    cout << endl << "Computation time: " << exactComputationTime << endl;


    // Clean up
    delete []charges;
    delete []location;
    return 0;
}

Answer 1

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. 正如评论中所述，MatLab依赖于英特尔的矩阵产品MKL库，这是用于此类操作的最快库。 Nonetheless, Eigen alone should be able to deliver similar performance. 尽管如此，Eigen本身应该能够提供类似的性能。 To this end, make sure to use latest Eigen (eg 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading: 为此，请确保使用最新的Eigen（例如3.4）和正确的编译标志来启用AVX / FMA（如果可用）和多线程：

-O3 -DNDEBUG -march=native

Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one. 由于charges_是一个向量，更好地使用VectorXd到Eigen知道你想要一个矩阵向量乘积而不是矩阵矩阵。

If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation. 如果你有英特尔的MKL，那么你也可以让Eigen 使用它来获得与MatLab完全相同的性能，以实现这种精确的操作。

Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen: 关于程序集，更好地反转两个循环以启用向量化，然后使用OpenMP启用多线程（添加-fopenmp作为编译器标志）以使最外层循环并行运行，最后您可以使用Eigen简化代码：

void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
    kernel.resize(M,N);
    auto x0 = ArrayXd::Map(x,M);
    auto x1 = ArrayXd::Map(x+M,M);
    auto y0 = ArrayXd::Map(y,N);
    auto y1 = ArrayXd::Map(y+N,N);
    #pragma omp parallel for
    for(unsigned long j=0;j<N;++j)
      kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}

With multi-threading you need to measure the wall clock time. 使用多线程，您需要测量挂钟时间。 Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours. 在这里（Haswell有4个物理内核运行在2.6GHz），组装时间下降到0.36s，N = 20000，矩阵矢量产品需要0.24s，所以总共0.6s比MatLab快，而我的CPU似乎更慢比你的。

Answer 2

You might be interested to look at the MATLAB Central contribution mtimesx . 您可能有兴趣查看MATLAB Central贡献mtimesx 。

Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. Mtimesx是一个mex函数，它使用BLAS库，openMP和其他方法优化矩阵乘法。 In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. 根据我的经验，当它最初发布时，在某些情况下可能会超过3个数量级的库存MATLAB。 (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. （我认为，MATHWORKS有点尴尬。）这些天MATLAB已经改进了自己的方法（我怀疑从中借用。）并且差异不那么严重。 MATLAB sometimes out-performs it. MATLAB有时会胜过它。

性能：Matlab与C ++矩阵向量乘法

问题描述

2 个解决方案

解决方案1
9 已采纳 2017-10-12 20:25:21

解决方案2
0 2018-05-30 11:01:53

性能：Matlab与C ++矩阵向量乘法

问题描述

2 个解决方案

解决方案1 9 已采纳 2017-10-12 20:25:21

解决方案2 0 2018-05-30 11:01:53

解决方案1
9 已采纳 2017-10-12 20:25:21

解决方案2
0 2018-05-30 11:01:53