简体   繁体   中英

Why is my Python NumPy code faster than C++?

Why is this Python NumPy code,

import numpy as np
import time

k_max = 40000
N = 10000

data = np.zeros((2,N))
coefs = np.zeros((k_max,2),dtype=float)

t1 = time.time()
for k in xrange(1,k_max+1):
    cos_k = np.cos(k*data[0,:])
    sin_k = np.sin(k*data[0,:])
    coefs[k-1,0] = (data[1,-1]-data[1,0]) + np.sum(data[1,:-1]*(cos_k[:-1] - cos_k[1:]))
    coefs[k-1,1] = np.sum(data[1,:-1]*(sin_k[:-1] - sin_k[1:]))
t2 = time.time()

print('Time:')
print(t2-t1)

I found this question interesting, because every time I encountered similar topic about the speed of NumPy (compared to C/C++) there was always answers like "it's a thin wrapper, its core is written in C, so it's fast", but this doesn't explain why C should be slower than C with additional layer (even a thin one).

The answer is: your C++ code is not slower than your Python code when properly compiled .

I've done some benchmarks, and at first it seemed that NumPy is surprisingly faster. But I forgot about optimizing the compilation with GCC .

I've computed everything again and also compared results with a pure C version of your code. I am using GCC version 4.9.2, and Python 2.7.9 (compiled from the source with the same GCC). To compile your C++ code I used g++ -O3 main.cpp -o main , to compile my C code I used gcc -O3 main.c -lm -o main . In all examples I filled data variables with some numbers (0.1, 0.4), as it changes results. I also changed np.arrays to use doubles ( dtype=np.float64 ), because there are doubles in C++ example. My pure C version of your code (it's similar):

#include <math.h>
#include <stdio.h>
#include <time.h>

const int k_max = 100000;
const int N = 10000;

int main(void)
{
    clock_t t_start, t_end;
    double data1[N], data2[N], coefs1[k_max], coefs2[k_max], seconds;
    int z;
    for( z = 0; z < N; z++ )
    {
        data1[z] = 0.1;
        data2[z] = 0.4;
    }

    int i, j;
    t_start = clock();
    for( i = 0; i < k_max; i++ )
    {
        for( j = 0; j < N-1; j++ )
        {
            coefs1[i] += data2[j] * (cos((i+1) * data1[j]) - cos((i+1) * data1[j+1]));
            coefs2[i] += data2[j] * (sin((i+1) * data1[j]) - sin((i+1) * data1[j+1]));
        }
    }
    t_end = clock();

    seconds = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("Time: %f s\n", seconds);
    return coefs1[0];
}

For k_max = 100000, N = 10000 results where following:

  • Python 70.284362 s
  • C++ 69.133199 s
  • C 61.638186 s

Python and C++ have basically the same time, but note that there is a Python loop of length k_max, which should be much slower compared to C/C++ one. And it is.

For k_max = 1000000, N = 1000 we have:

  • Python 115.42766 s
  • C++ 70.781380 s

For k_max = 1000000, N = 100 :

  • Python 52.86826 s
  • C++ 7.050597 s

So the difference increases with fraction k_max/N , but python is not faster even for N much bigger than k_max , eg k_max = 100, N = 100000 :

  • Python 0.651587 s
  • C++ 0.568518 s

Obviously, the main speed difference between C/C++ and Python is in the for loop. But I wanted to find out the difference between simple operations on arrays in NumPy and in C. Advantages of using NumPy in your code consists of: 1. multiplying the whole array by a number, 2. calculating sin/cos of the whole array, 3. summing all elements of the array, instead of doing those operations on every single item separately. So I prepared two scripts to compare only these operations.

Python script:

import numpy as np
from time import time

N = 10000
x_len = 100000

def main():
    x = np.ones(x_len, dtype=np.float64) * 1.2345

    start = time()
    for i in xrange(N):
        y1 = np.cos(x, dtype=np.float64)
    end = time()
    print('cos: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        y2 = x * 7.9463
    end = time()
    print('multi: {} s'.format(end-start))

    start = time()
    for i in xrange(N):
        res = np.sum(x, dtype=np.float64)
    end = time()
    print('sum: {} s'.format(end-start))

    return y1, y2, res

if __name__ == '__main__':
    main()

# results
# cos: 22.7199969292 s
# multi: 0.841291189194 s
# sum: 1.15971088409 s

C script:

#include <math.h>
#include <stdio.h>
#include <time.h>

const int N = 10000;
const int x_len = 100000;

int main()
{
    clock_t t_start, t_end;
    double x[x_len], y1[x_len], y2[x_len], res, time;
    int i, j;
    for( i = 0; i < x_len; i++ )
    {
        x[i] = 1.2345;
    }

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y1[i] = cos(x[i]);
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("cos: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        for( i = 0; i < x_len; i++ )
        {
            y2[i] = x[i] * 7.9463;
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("multi: %f s\n", time);

    t_start = clock();
    for( j = 0; j < N; j++ )
    {
        res = 0.0;
        for( i = 0; i < x_len; i++ )
        {
            res += x[i];
        }
    }
    t_end = clock();
    time = (double)(t_end - t_start) / CLOCKS_PER_SEC;
    printf("sum: %f s\n", time);

    return y1[0], y2[0], res;
}

// results
// cos: 20.910590 s
// multi: 0.633281 s
// sum: 1.153001 s

Python results:

  • cos: 22.7199969292 s
  • multi: 0.841291189194 s
  • sum: 1.15971088409 s

C results:

  • cos: 20.910590 s
  • multi: 0.633281 s
  • sum: 1.153001 s

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

On my computer, your (current) Python code runs in 14.82 seconds (yes, my computer's quite slow).

I rewrote your C++ code to something I'd consider halfway reasonable (basically, I almost ignored your C++ code and just rewrote your Python into C++. That gave me this:

#include <cstdio>
#include <iostream>
#include <cmath>
#include <chrono>
#include <vector>
#include <assert.h>

const unsigned int k_max = 40000;
const unsigned int N = 10000;

template <class T>
class matrix2 {
    std::vector<T> data;
    size_t cols;
    size_t rows;
public:
    matrix2(size_t y, size_t x) : cols(x), rows(y), data(x*y) {}
    T &operator()(size_t y, size_t x) {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }

    T operator()(size_t y, size_t x) const {
        assert(x <= cols);
        assert(y <= rows);
        return data[y*cols + x];
    }
};

int main() {
    matrix2<double> data(N, 2);
    matrix2<double> coeffs(k_max, 2);

    using namespace std::chrono;

    auto start = high_resolution_clock::now();

    for (int k = 0; k < k_max; k++) {
        for (int j = 0; j < N - 1; j++) {
            coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
            coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
        }
    }

    auto end = high_resolution_clock::now();
    std::cout << duration_cast<milliseconds>(end - start).count() << " ms\n";
}

This ran in about 14.4 seconds, so it's a slight improvement over the Python version--but given that the Python is mostly a pretty thin wrapper around some C code, getting only a slight improvement is pretty much what we should expect.

The next obvious step would be to use multiple cores. To do that in C++, we can add this line:

#pragma omp parallel for

...before the outer for loop:

#pragma omp parallel for
for (int k = 0; k < k_max; k++) {
    for (int j = 0; j < N - 1; j++) {
        coeffs(k, 0) += data(j, 1) * (cos((k + 1)*data(j, 0)) - cos((k + 1)*data(j+1, 0)));
        coeffs(k, 1) += data(j, 1) * (sin((k + 1)*data(j, 0)) - sin((k + 1)*data(j+1, 0)));
    }
}

With -openmp added to the compiler's command line (though the exact flag depends on the compiler you're using, of course), this ran in about 4.8 seconds. If you have more than 4 cores, you can probably expect a larger improvement than that though (conversely, if you have fewer than 4 cores, expect a smaller improvement--but nowadays, more than 4 is a lot more common that fewer).

I tried to understand your Python code and reproduce it in C++. I found that you didn't represent correctly the for-loops in order to do the correct calculations of the coeffs , hence should switch your for-loops . If this is the case, you should have the following:

#include <iostream>
#include <cmath>
#include <time.h>

const int k_max = 40000;
const int N = 10000;

double cos_k, sin_k;

int main(int argc, char const *argv[])
{
    time_t start, stop;
    double data[2][N];
    double coefs[k_max][2];

    time(&start);

    for(int i=0; i<k_max; ++i)
    {
        for(int j=0; j<N; ++j)
        {
            coefs[i][0] += data[1][j-1] * (cos((i+1) * data[0][j-1]) - cos((i+1) * data[0][j]));
            coefs[i][1] += data[1][j-1] * (sin((i+1) * data[0][j-1]) - sin((i+1) * data[0][j]));
        }
    }
    // End of main loop

    time(&stop);

    // Speed result
    double diff = difftime(stop, start);
    std::cout << "Time: " << diff << " seconds" << std::endl;

    return 0;
}

Switching the for-loops gives me: 3 seconds for C++ code, optimized with -O3 , while Python code runs at 7.816 seconds.

The Python code can't be faster than properly-coded C++ code since Numpy is coded in C, which is often slower than C++ since C++ can do more optimizations. They'll only be around each other with Python running somewhere between the same time as C++ to about twice C++ when doing the majority of your computation in large computations that Python pushes off to compiled binaries to calculate. Most anything beyond large matrix multiplication, addition, scalar on matrix multiplication, etc. will perform much worse in Python. For example, look at the Benchmark Game where people submit solutions to various algorithms in various languages, and the website keeps track of the fastest submissions for each (algorithm, language) pair. You can even view the source code for each submission. For most test cases, Python is 2-15 times slower than C++. That makes sense too if you do anything other than simple math operations - anything with linked lists, binary search trees, procedural code, etc. The interpreted nature of Python combined with it storing metadata for each object (even int , double , float , etc.) significantly bogs things down in a way that no Python programmer can fix.

I am actually surprised that no one mentioned Linear Algebra libraries like BLAS LAPACK MKL and all...

Numpy is using complex Linear Algebra libraries ! Essentially, Numpy is most of the time not built on pure c/cpp/fortran code... it is actually built on complex libraries that take advantage of the most performant algorithms and ideas to optimise the code. These complex libraries are hardly matched by naive implementation of classic linear algebra computations. The simplest first example of improvement is the blocking trick.

I took the following image from the CSE lab of ETH, where they compare matrix vector multiplication for different implementation. The y-axis represents the intensity of computations (in GFLOPs); long story short, it is how fast the computations are done. The x-axis is the dimension of the matrix.

在此处输入图像描述

C and C++ are fast languages, but actually if you want to mimic the speed of these libraries, you might have to go one step deeper and use either Fortran or intrinsics instructions (that are perhaps the closest to assembly code you can do in C++).

Consider the question Benchmarking (python vs. c++ using BLAS) and (numpy) , where the very good answer from @Jfs, and we observe: "There is no difference between C++ and numpy on my machine."

Some more reference:

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM