Why is my Numpy test code 2X slower than in Matlab

Question

I've been developing a Fresnel coefficient based reflectivity solver in Python and I've hit a bit of a roadblock as the performance in Python + Numpy is 2X slower than in Matlab. I've distilled the problem code into a simple example to show the operation being performed in each case:

Python Code for test case:

import numpy as np
import time

def compare_fn(i):
    a = np.random.rand(400)
    vec = np.random.rand(400)
    t = time.time()
    for j in xrange(i):
        a = (2.3 + a * np.exp(2j*vec))/(1 + (2.3 * a * np.exp(2j*vec)))
    print (time.time()-t)
    return a

a = compare_fn(200000)

Output: 10.7989997864

Equivalent Matlab code:

function a = compare_fn(i)

    a = rand(1, 400);
    vec = rand(1, 400);
    tic
    for m = 1:i
        a = (2.3 + a .* exp(2j*vec))./(1 + (2.3 * a .* exp(2j*vec)));
    end
    toc

a = compare_fn(200000); Elapsed time is 5.644673 seconds.

I'm stumped by this. I already have MKL installed (Anaconda Academic License). I would greatly appreciate any help in identifying the issue with my example if any and how I can achieve equivalent if not better performance using Numpy.

In general, I cannot parallelize the loop as solving the Fresnel coefficients for a multilayer involves a recursive calculation which can be expressed in the form of a loop as above.

Answer 1

The following is similar to unutbu's deleted answer, and for your sample input runs 3x faster on my system. It will probably also run faster if you implement it like this in Matlab, but that's a different story. To be able to use ipython's %timeit functionality I have rewritten your original function as:

def fn(a, vec, i):
    for j in xrange(i):
        a = (2.3 + a * np.exp(2j*vec))/(1 + (2.3 * a * np.exp(2j*vec)))
    return a

And I have optimized it by removing the exponential calculation from the loop:

def fn_bis(a, vec, n):
    exp_vec = np.exp(2j*vec)
    for j in xrange(n):
        a = (2.3 + a * exp_vec) / (1 + 2.3 * a * exp_vec)
    return a

Taking both approaches for a test ride:

In [2]: a = np.random.rand(400)

In [3]: vec = np.random.rand(400)

In [9]: np.allclose(fn(a, vec, 100), fn_bis(a, vec, 100))
Out[9]: True

In [10]: %timeit fn(a, vec, 100)
100 loops, best of 3: 8.43 ms per loop

In [11]: %timeit fn_bis(a, vec, 100)
100 loops, best of 3: 2.57 ms per loop

In [12]: %timeit fn(a, vec, 200000)
1 loops, best of 3: 16.9 s per loop

In [13]: %timeit fn_bis(a, vec, 200000)
1 loops, best of 3: 5.25 s per loop

Answer 2

I've been doing a lot of experimenting to try and determine the source of the speed difference between Matlab and Python/Numpy for the example in my original question. Some of the key findings have been:

Matlab now has a JIT compiler that provides significant benefit in situations involving loops. Turning it off reduces performance by 2X making it similar in speed to the native Python + Numpy code.
feature accel off
a = compare_fn(200000);
Elapsed time is 9.098062 seconds.
I then began exploring options for optimizing my example function using Numba and Cython to see how much better I could do. The one significant finding for me was that Numba JIT optimization on an explicit looped calculation was faster than native vectorized math operations on Numpy arrays . I don't quite understand why this is the case, but I have included my sample code and timing for tests below. I also played with Cython (I'm no expert) and although it was also quicker, Numba was still 2X faster than Cython, so I ended up sticking with Numba for the tests.

Here is the code for 3 equivalent functions. First one is a Numba optimized function with an explicit loop to perform elementwise calculations. Second function is a Python+Numpy function relying on Numpy vectorization to perform calculations. The third function tries to use Numba to optimize the vectorized Numpy code (and fails to improve as you can see in the results). Lastly, I've included the Cython code, though I only tested it for one case.

import numpy as np
import numba as nb

@nb.jit(nb.complex128[:](nb.int16, nb.int16))
def compare_fn_jit(i, j):
    a = np.asarray(np.random.rand(j), dtype=np.complex128)
    vec = np.random.rand(j)
    exp_term = np.exp(2j*vec)

    for k in xrange(i):
        for l in xrange(j):
            a[l] = (2.3 + a[l] * exp_term[l])/(1 + (2.3 * a[l] * exp_term[l]))
    return a

def compare_fn(i, j):
    a = np.asarray(np.random.rand(j), dtype=np.complex128)
    vec = np.random.rand(j)
    exp_term = np.exp(2j*vec)
    for k in xrange(i):
        a = (2.3 + a * exp_term)/(1 + (2.3 * a * exp_term))
    return a

compare_fn_jit2 = nb.jit(nb.complex128[:](nb.int16, nb.int16))(compare_fn)


import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False)
def compare_fn_cython(int i, int j):
    cdef int k, l
    cdef np.ndarray[np.complex128_t, ndim=1] a, vec, exp_term
    a = np.asarray(np.random.rand(j), dtype=np.complex128)
    vec = np.asarray(np.random.rand(j), dtype=np.complex128)
    exp_term = np.exp(2j*vec)

    for k in xrange(i):
        for l in xrange(j):
            a[l] = (2.3 + a[l] * exp_term[l])/(1 + (2.3 * a[l] * exp_term[l]))
    return a

Timing Results:

i. Timing for a single outer loop - Demonstrates efficiency of vectorized calculations

%timeit -n 1 -r 10 compare_fn_jit(1,1000000) 1 loops, best of 10: 352 ms per loop

%timeit -n 1 -r 10 compare_fn(1,1000000) 1 loops, best of 10: 498 ms per loop

%timeit -n 1 -r 10 compare_fn_jit2(1,1000000) 1 loops, best of 10: 497 ms per loop

%timeit -n 1 -r 10 compare_fn_cython(1,1000000) 1 loops, best of 10: 424 ms per loop

ii. Timing in extreme case of large loops with calculations on short arrays (expect Numpy+Python to perform poorly)

%timeit -n 1 -r 5 compare_fn_jit(1000000,40) 1 loops, best of 5: 1.44 s per loop

%timeit -n 1 -r 5 compare_fn(1000000,40) 1 loops, best of 5: 28.2 s per loop

%timeit -n 1 -r 5 compare_fn_jit2(1000000,40) 1 loops, best of 5: 29 s per loop

iii. Test for somewhere mid-way between the two cases above

%timeit -n 1 -r 5 compare_fn_jit(100000,400) 1 loops, best of 5: 1.4 s per loop

%timeit -n 1 -r 5 compare_fn(100000,400) 1 loops, best of 5: 5.26 s per loop

%timeit -n 1 -r 5 compare_fn_jit2(100000,400) 1 loops, best of 5: 5.34 s per loop

As you can see, using Numba can improve efficiency by a factor ranging from 1.5X - 30X for this particular case. I am truly impressed with how efficient it is and how easy it is to use and implement when compared against Cython.

Answer 3

I don't know if numpypy is far enough along yet for what you're doing, but you might try it.

http://buildbot.pypy.org/numpy-status/latest.html

Why is my Numpy test code 2X slower than in Matlab

Question

3 answers

solution1
3 2014-04-15 02:40:18

solution2
2 ACCPTED 2014-04-17 21:16:08

solution3
0 2014-04-15 01:52:54

Why is my Numpy test code 2X slower than in Matlab

Question

3 answers

solution1 3 2014-04-15 02:40:18

solution2 2 ACCPTED 2014-04-17 21:16:08

solution3 0 2014-04-15 01:52:54

solution1
3 2014-04-15 02:40:18

solution2
2 ACCPTED 2014-04-17 21:16:08

solution3
0 2014-04-15 01:52:54