简体   繁体   中英

Tips for optimising code in Cython

I have a relatively simple question (I think). I'm working on a piece of Cython code that computes the radius of a strain ellipse when the strain and a specific direction are given (ie the radius parallel to the given direction for a certain amount of strain). This function is called several milion times during each program run and profiling revealed that this function is the limiting factor performance-wise speaking. Here's the code:

# importing math functions from a C-library (faster than numpy)
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class funcs:

    cdef inline double get_r(self, double g, double omega):
        # amount of strain: g, angle: omega
        cdef double l1, l2, A, r, g2, gs   # defining some variables
        if g == 0: return 1   # no strain means the strain ellipse is a circle
        omega = omega*M_PI/180   # converting angle omega to radians
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)   # l1 and l2: eigenvalues of the Cauchy strain tensor
        l2 = 0.5*(2 + g2 - gs)
        A = acos(g/sqrt(g2 + (1 - l2)**2))   # orientation of the long axis of the ellipse
        r = 1./sqrt(sqrt(l2)*(cos(omega - A)**2) + sqrt(l1)*(sin(omega - A)**2))   # the radius parallel to omega
        return r   # return of the jedi

Running this code takes about 0.18 microseconds per call, which I think is a bit long for such a simple function. Also, math.h has a square(x) function, but I can't import it from the libc.math library, anyone knows how? Any other suggestions for further improving the performance of this little piece of code?

UPDATE 2013/09/04:

There seems to be more at play than meets the eye. When I profile one function that calls get_r 10 milion times I get different performance than calling another function. I've added an updated version of my partial code. When I use get_r_profile for profiling, I get 0.073 microsec for each call of get_r , whereas MC_criterion_profile gives me about 0.164 microsec/call of get_r , a 50% difference which seems to be related to the overhead cost of return r .

from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class thesis_funcs:

    cdef inline double get_r(self, double g, double omega):
        cdef double l1, l2, A, r, g2, gs, cos_oa2, sin_oa2
        if g == 0: return 1
        omega = omega*SCALEDPI
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)
        l2 = l1 - gs
        A = acos(g/sqrt(g2 + square(1 - l2)))
        cos_oa2 = square(cos(omega - A))
        sin_oa2 = 1 - cos_oa2
        r = 1.0/sqrt(sqrt(l2)*cos_oa2 + sqrt(l1)*sin_oa2)
        return r

    @cython.profile(False)
    cdef inline double get_mu(self, double r, double mu0, double mu1):
        return mu0*exp(-mu1*(r - 1))

    def get_r_profile(self): # Profiling through this guy gives me 0.073 microsec/call
        cdef unsigned int i
        for i from 0 <= i < 10000000:
            self.get_r(3.0, 165)

    def MC_criterion(self, double g, double omega, double mu0, double mu1, double C = 0.0):
        cdef double r, mu, theta, res
        r = self.get_r(g, omega)
        mu = self.get_mu(r, mu0, mu1)
        theta = 45 - omega
        theta = theta*SCALEDPI
        res = fabs(g*sin(2.0*theta)) - mu*(1 + g*cos(2.0*theta)) - C
        return res

    def MC_criterion_profile(self): # Profiling through this one gives 0.164 microsec/call
        cdef double g, omega, mu0, mu1
        cdef unsigned int i
        omega = 165
        mu0 = 0.6
        mu1 = 2.0
        g = 3.0
        for i from 1 <= i < 10000000:
            self.MC_criterion(g, omega, mu0, mu1)

I think there might be a fundamental difference between get_r_profile and MC_criterion which causes extra overhead cost. Can you spot it?

According to your comment, the line computing r is the most expensive. If that's the case, then I suspect it's the trig function calls that are killing performance.

By Pythagoras, cos(x)**2 + sin(x)**2 == 1 so you can skip one of those calls by computing

cos_oa2 = cos(omega - A)**2
sin_oa2 = 1 - cos_oa2
r = 1. / sqrt(sqrt(l2) * cos_oa2 + sqrt(l1) * sin_oa2)

(Or maybe flip them: on my machine, sin seems faster than cos . Might be a NumPy glitch, though.)

The output of

cython -a

shows that the division by 0 is tested. You might want to remove this check if you're 200% sure it won't happen.

To use the C division you can add the following directive to the top of your file :

# cython: cdivision=True

I'd link the official documentation but I can't access it right now. You have some information here (p15) : https://python.g-node.org/python-summerschool-2011/_media/materials/cython/cython-slides.pdf

This answer is not related to Cython, but should mention some points which might be helpful.

  1. Defining vars before really knowing if they're needed might be not ideal. Move "cdef double l1, l2, A, r, g2, gs" after the "if g == 0" statement.
  2. I would ensure that from "omega = omega*M_PI/180" the M_PI/180 part is only calculated once. Eg some Python code:

     import timeit from math import pi def calc1( omega ): return omega * pi / 180 SCALEDPI = pi / 180 def calc2( omega ): return omega * SCALEDPI if __name__ == '__main__': took = timeit.repeat( stmt = lambda:calc1( 5 ), number = 10000 ) print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) ) took = timeit.repeat( stmt = lambda:calc2( 5 ), number = 10000 ) print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) ) 

    calc1: Min. time: 0.0033 Max. time: 0.0034

    calc2: Min. time: 0.0029 Max. time: 0.0029

  3. Try to optimize the calculations themself. They look rather complicated and I have a feeling that they can be simplified.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM