简体   繁体   中英

How to make numba(nopython=true) work with 1D numpy.ndarray input with unknown number of elements

I'm porting a (mathematically complicated/involved but few operations) homebrew empirical distribution class from C++/MATLAB (I have both) to Python.

The file has some 1100 lines of code including comments and test data including a

if __name__ == "__main__": 

at the bottom of the file.

line 83 has the function declaration: def cdf(self, x):

Which compiled and ran fine it's just very slow so I want to compile with @numba.jit(nopython=True) to make it run faster.

However, the compilation died on one of the earliest lines of the function (only comments in front of it) line 85 of the file npts=len(x) .

The message ends with:

[1] During: typing of argument at
C:\Users\kdalbey\Canopy\scripts\empDist.py (85)
--%<-----------------------------------------------------------------

File "Canopy\scripts\empDist.py", line 85

This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class '__main__.empDist'>

Now I really did a import numpy as np at the top of the file but for clarity of this message below I've tried to replace np with numpy . But I might have missed a few.

If I use npts=x.size , I get the same error message.

So I try to type x as:

@numba.jit(nopython=True)
def cdf(self, x: numpy.ndarray(dtype=numpy.float64)):

And I get following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\kdalbey\Canopy\scripts\empDist.py in <module>()
     15 np.set_printoptions(precision=16)
     16 
---> 17 class empDist:
     18     def __init__(self, xdata):
     19         npts=len(xdata)
C:\Users\kdalbey\Canopy\scripts\empDist.py in empDist()
     81 
     82     @numba.jit(nopython=True)
---> 83     def cdf(self, x: np.ndarray(dtype=np.float64)):
     84         # compute the value of cdf at vector of points x
     85         npts = x.size
TypeError: Required argument 'shape' (pos 1) not found

But I don't know how many elements the 1D numpy.ndarray has in advance (it's arbitrary)

I guessed that I might be able to do a

@numba.jit(nopython=True)
def cdf(self, x: numpy.ndarray(shape=(), dtype=numpy.float64)):

and it gets past that error only to go back to the

[1] During: typing of argument at
C:\Users\kdalbey\Canopy\scripts\empDist.py (85)
--%<-----------------------------------------------------------------
File "Canopy\scripts\empDist.py", line 85
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of <class '__main__.empDist'>

And it's the same error if I do either a npts=int(x.size) or npts=numpy.int32(x.size) so I'm figuring the problem is with x .

Your approach is problematic because of multiple issues (as of numba version 0.46.0):

  • The numpy.ndarray(shape=(), dtype=numpy.float64) really tries to create a NumPy array. It doesn't matter that you used it as type-hint. It's still executed (and fails).
  • Instead of type hints you should use the more appropriate (for numba) signature in the jit . Or even better: Omit the signature entirely and just let numba figure it out. In most cases numba is better at it and it takes you much less effort (if you don't need to restrict the types).
  • You cannot jit a method in nopython mode. A better approach is to make a function and call it from your method.

So in your case:

import numba as nb

@nb.njit
def _cdf(x):
    # do something with x

class empDist:
    def cdf(self, x):
        result = _cds(x)
        ...

Your example might be more complicated however that should give you a good place to start from. If you need to use instance attributes, then simply pass them along to _cdf (if numba supports them ).


In general it's not really a good idea to try to use numba on everything. Numba has a very limited scope but where it's applicable it can be amazing.

In your case you said, that it's slow. The first step then should be to profile your code and find out why it's slow and where. Then try to find out if you can wrap this bottleneck with a faster approach. Often the problem isn't the code itself but the algorithm/approach. Check if it uses a sub-optimal approach. If it doesn't an it's a numerical heavy part then it might make sense to use numba - but be warned: often you don't really need numba at all because you can get sufficient performance just by optimizing the NumPy parts.

Ok... the problem was that it was a method (member function), I got that from MrFuppes. isolating it in it's own function that the method called worked great (with almost no mods to the function that worked pre-numba).

BTW I will try to get approval to release/publish the empirical distribution code, but it'll be a ways off. I also might want to learn cython and recode it for speed in cython, the compilation takes O(seconds) on my machine because the operations are mathematically complicated/involved but there's not a lot of them from a flops count perspective. Selling points compared to the sklearn.neighbors.kde my empirical distribution is significantly faster (after/discounting the @numba.jit(nopython=True) compilation caching). Running in canopy (with numba 0.36.2 so np.interp didn't benefit from numba) on windows building this empirical distribution from took 5.72e-5 seconds compared to fitting the sklearn kde which cost 2.03e-4 seconds for 463 points. Moreover it should scale quite well to very large numbers of points. Apart from a quicksort which is O(n log(n)) and interpolation which is O(n) the construction (and memory needed to store the object) cost is O(n^(1/3)) (with a significant coefficient to the O(n^(1/3)). It has "simple" analytical formulas for the PDF, CDF and inverse CDF so the empirical distribution is a lot faster to evaluate too. It has comparable/slightly better accuracy to the sklearn KDE for the Gaussian (using bandwidth = (maxx-minx)*0.015 I copied the bandwidth out so some else's code who is presumably better with the sklearn kde than I am, obviously accuracy of a kde significantly depends on the bandwidth, my empirical distribution does not take any parameters other than the data during construction, it algorithmicly figures out everything it needs to know about the data) and substantially better accuracy for stuff with finite tails (eg uniform or exponential). The improved accuracy is coming in part from it being smoother/less oscillatory than the sklearn kde.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM