Why is numpy much slower than matlab on a digitize example?

Question

I am comparing performance of numpy vs matlab , in several cases I observed that numpy is significantly slower (indexing, simple operations on arrays such as absolute value, multiplication, sum, etc.). Let's look at the following example, which is somehow striking, involving the function digitize (which I plan to use for synchronizing timestamps):

import numpy as np
import time
scale=np.arange(1,1e+6+1)
y=np.arange(1,1e+6+1,10)
t1=time.time()
ind=np.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)

The result is:

Time passed is 55.91 seconds

Let's now try the same example Matlab using the equivalent function histc

scale=[1:1e+6];
y=[1:10:1e+6];
tic
[N,bin]=histc(scale,y);
t=toc;
display(['Time passed is ',num2str(t), ' seconds'])

The result is:

Time passed is 0.10237 seconds

That's 560 times faster!

As I'm learning to extend Python with C++, I implemented my own version of digitize (using boost libraries for the extension):

import analysis # my C++ module implementing digitize
t1=time.time()
ind2=analysis.digitize(scale,y)
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)
np.all(ind==ind2) #ok

The result is:

Time passed is 0.02 seconds

There is a bit of cheating as my version of digitize assumes inputs are all monotonic, this might explain why it is even faster than Matlab. However, sorting an array of size 1e+6 takes 0.16 seconds (with numpy.sort), making therefore the performance of my function worse (by a factor of approx 1.6) compared to the Matlab function histc .

So the questions are:

Why is numpy.digitize so slow? Is this function not supposed to be written in compiled and optimized code?
Why is my own version of digitize much faster than numpy.digitize, but still slower than Matlab (I am quite confident I use the fastest algorithm possible, given that I assume inputs are already sorted)?

I am using Fedora 16 and I recently installed ATLAS and LAPACK libraries (but there has been so change in performance). Should I perhaps rebuild numpy? I am not sure if my installation of numpy uses the appropriate libraries to gain maximum speed, perhaps Matlab is using better libraries.

Update

Based on the answers so far, I would like to stress that the Matlab function histc is not equivalent to numpy.histogram if someone (like me in this case) does not care about the histogram. I need the second output of hisc, which is a mapping from input values to the index of the provided input bins. Such an output is provided by the numpy functions digitize and searchsorted . As one of the answers says, searchsorted is much faster than digitize . However, searchsorted is still slower than Matlab by a factor 2 :

t1=time.time()
ind3=np.searchsorted(y,scale,"right")
t2=time.time()
print 'Time passed is %2.2f seconds' %(t2-t1)

np.all(ind==ind3) #ok

The result is

Time passed is 0.21 seconds

So the questions are now:

What is the sense of having numpy.digitize if there is an equivalent function numpy.searchsorted which is 280 times faster ?
Why is the Matlab function histc (which also provides the output of numpy.searchsorted ) 2 times faster than numpy.searchsorted ?

Answer 1

First, let's look at why numpy.digitize is slow. If your bins are found to be monotonic, then one of these functions is called depending on whether the bins are nondecreasing or nonincreasing (the code for this is found in numpy/lib/src/_compiled_base.c in the numpy git repo):

static npy_intp
incr_slot_(double x, double *bins, npy_intp lbins)
{
    npy_intp i;

    for ( i = 0; i < lbins; i ++ ) {
        if ( x < bins [i] ) {
            return i;
        }
    }
    return lbins;
}

static npy_intp
decr_slot_(double x, double * bins, npy_intp lbins)
{
    npy_intp i;

    for ( i = lbins - 1; i >= 0; i -- ) {
        if (x < bins [i]) {
            return i + 1;
        }
    }
    return 0;
}

As you can see, it is doing a linear search. Linear search is much, much slower than binary search so there is your answer as to why it is slow. I will open a ticket for this on the numpy tracker.

Second, I think that Matlab is actually slower than your C++ code because Matlab also assumes that the bins are monotonically nondecreasing.

Answer 2

I can't answer why numpy.digitize() is so slow -- I could confirm your timings on my machine.

The function numpy.searchsorted() does basically the same thing as numpy.digitize() , but efficiently.

ind = np.searchsorted(y, scale, "right")

takes about 0.15 seconds on my machine and gives exactly the same result as your code.

Note that your Matlab code does something different from both of those functions -- it is the equivalent of numpy.histogram() .

Answer 3

Before the question can get answered, several subquestions need to be addressed:

In order to get more reliable results, you should run several iterations of the tests and average their results. This would somehow eliminate startup effects, which do not have anything to do with the algorithm. Also, try to use larger data for the same purpose.
Use the same algortihms across the frameworks. This has already been addressed in other answers here.
Make sure, the algorithms are really similar enough. How do they utilize system ressources? How is iterated over memory ? If (just an example) a Matlab algorithm uses repmat and the numpy would not, the comparison is not fair.
How does the corresponding framework parallelize? This possibly is connected to your individual machine / processor configuration. Matlab does parallelize some (but by far not all) builtin functions. I dont know about numpy/CPython.
Use a memory profiler in order to find out, how both implementations behave from that performance point of view.

Afterwards (this is only a guess) we probably will find out, numpy does often behave slower than Matlab. Many questions here at SO come to the same conclusion. One explanation could be, that Matlab has an easier job to optimize array access, because it does not need to take into account a whole collection of general purpose objects (like CPython). The requirements on mathematical arrays are much lower than those on general arrays. numpy on the other hand does utilize CPython, which must serve the full python library - not only numpy. However, according to this comparison test (among many others) Matlab is still pretty slow ...

Answer 4

I don't think you are comparing the same functions in numpy and matlab. The equivalent to histc is np.histogram as far as I can tell from looking at the documentation. I don't have matlab to do a comparison, but when I do the following on my machine:

In [7]: import numpy as np

In [8]: scale=np.arange(1,1e+6+1)

In [9]: y=np.arange(1,1e+6+1,10)

In [10]: %timeit np.histogram(scale,y)
10 loops, best of 3: 135 ms per loop

I get a number that is approximately equivalent to what you get for histc .

Why is numpy much slower than matlab on a digitize example?

Question

4 answers

solution1
19 ACCPTED 2012-02-25 15:49:12

solution2
5 2012-02-25 14:02:09

solution3
2 2012-02-25 16:22:14

solution4
1 2012-02-25 13:59:17

Why is numpy much slower than matlab on a digitize example?

Question

4 answers

solution1 19 ACCPTED 2012-02-25 15:49:12

solution2 5 2012-02-25 14:02:09

solution3 2 2012-02-25 16:22:14

solution4 1 2012-02-25 13:59:17

solution1
19 ACCPTED 2012-02-25 15:49:12

solution2
5 2012-02-25 14:02:09

solution3
2 2012-02-25 16:22:14

solution4
1 2012-02-25 13:59:17