本征+ MKL或OpenBLAS比Numpy / Scipy + OpenBLAS慢

Question

我从c ++ atm开始，希望使用矩阵并总体上加快处理速度。 之前使用过Python + Numpy + OpenBLAS。 以为c ++ + Eigen + MKL可能更快或更慢。

我的C ++代码：

#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>

using namespace std;
using namespace Eigen;

int main()
{
    int n = Eigen::nbThreads( );
    cout << "#Threads: " << n << endl;

    uint16_t size = 4000;
    MatrixXd a = MatrixXd::Random(size,size);

    clock_t start = clock ();
    PartialPivLU<MatrixXd> lu = PartialPivLU<MatrixXd>(a);

    float timeElapsed = double( clock() - start ) / CLOCKS_PER_SEC; 
    cout << "Elasped time is " << timeElapsed << " seconds." << endl ;
}

我的Python代码：

import numpy as np
from time import time
from scipy import linalg as la

size = 4000

A = np.random.random((size, size))

t = time()
LU, piv = la.lu_factor(A)
print(time()-t)

我的时间：

C++     2.4s
Python  1.2s

为什么C ++比Python慢？

我正在使用以下方式编译C ++：

g++ main.cpp -o main -lopenblas -O3 -fopenmp  -DMKL_LP64 -I/usr/local/include/mkl/include

MKL确实可以正常工作：如果禁用它，则运行时间约为13秒。

我还尝试了C ++ + OpenBLAS，它也给了我约2.4秒的时间。

为什么C ++和Eigen比numpy / scipy慢？

Answer 1

时机错了。 这是挂钟时间与CPU时间的典型征兆。 当我使用<chrono>标头中的system_clock ，它“神奇地”变得更快。

#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>

int main()
{
    int const n = Eigen::nbThreads( );
    std::cout << "#Threads: " << n << std::endl;

    int const size = 4000;
    Eigen::MatrixXd a = Eigen::MatrixXd::Random(size,size);

    auto start = std::chrono::system_clock::now();

    Eigen::PartialPivLU<Eigen::MatrixXd> lu(a);

    auto stop = std::chrono::system_clock::now();

    std::cout << "Elasped time is "
              << std::chrono::duration<double>{stop - start}.count()
              << " seconds." << std::endl;
}

我编译

icc -O3 -mkl -std=c++11 -DNDEBUG -I/usr/include/eigen3/ test.cpp

并获得输出

#Threads: 1
Elasped time is 0.295782 seconds.

您的Python版本在我的计算机上报告0.399146080017 。

另外，要获得可比的时序，可以在Python中使用time.clock() （CPU时间）而不是time.time() （挂钟时间）。

Answer 2

这是不公平的比较。 python例程以浮点精度运行，而c ++代码需要进行双精度运算。 这恰好使计算时间加倍。

>>> type(np.random.random_sample())
<type 'float'>

您应该与MatrixXf而不是MatrixXd进行比较，并且您的MKL代码应该同样快。

本征+ MKL或OpenBLAS比Numpy / Scipy + OpenBLAS慢

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-09-13 21:11:56

解决方案2
0 2017-09-13 20:59:09

本征+ MKL或OpenBLAS比Numpy / Scipy + OpenBLAS慢

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-09-13 21:11:56

解决方案2 0 2017-09-13 20:59:09

解决方案1
4 已采纳 2017-09-13 21:11:56

解决方案2
0 2017-09-13 20:59:09