简体   繁体   中英

Pandas dataframe - speed in python: dataframe operations, numba, cython

I have a financial dataset with ~2 million rows. I would like to import it as a pandas dataframe and add additional columns by applying rowwise functions utilizing some of the existing column values. For this purpose I would like to not use any techniques like parallelization, hadoop for python, etc, and so I'm faced with the following:

I am already doing this similar to the example below and performance is poor, ~24 minutes to just get through ~20K rows. Note: this is not the actual function, it is completely made up. For the additional columns I am calculating various financial option metrics. I suspect the slow speed is primarily due to iterating over all the rows, not really the functions themselves as they are fairly simple (eg calculating price of an option). I know I can speed up little things in the functions themselves, such as using erf instead of the normal distribution, but for this purpose I want to focus on the holistic problem itself.

def func(alpha, beta, time, vol):
    px = (alpha*beta)/time * vol
    return px

# Method 1 (could also use itertuples here) - this is the one that takes ~24 minutes now
for row in df.iterrows():
    df['px'][row] = func(alpha, beta, df['time'][row], df['vol'][row])

I have also tried vectorizing this but keep getting an error about 'cannot serialize float' or something like that.

My thought is to try one of the following methods, and I am not sure which one would theoretically be fastest? Are there non-linearities associated with running these, such that a test with 1000 rows would not necessarily indicate which would be fastest across all 2 million rows? Probably a separate question, but should I focus on more efficient ways to manage the dataset rather than just focus on applying the functions?

# Alternative 1 (df.apply with existing function above)
df['px'] = df.apply(lambda row: func(alpha, beta, row['time'], row['vol']), axis=1)

# Alternative 2 (numba & jit)
@jit
def func(alpha, beta, time, vol):
    px = (alpha*beta)/time * vol
    return px

# Alternative 3 (cython)
def func_cython(double alpha, double beta, double time, double vol):
    cdef double px
    px = (alpha*beta)/time * vol
    return px

In the case of Cython and numba, would I still iterate over all the rows using df.apply? Or is there a more efficient way?

I have referenced the following and found them to be helpful in understanding the various options, but not what the 'best' way is to do this (though I suppose it depends ultimately on the application).

https://lectures.quantecon.org/py/need_for_speed.html

Numpy vs Cython speed

Speeding up a numpy loop in python?

Cython optimization

http://www.devx.com/opensource/improve-python-performance-with-cython.html

How about simply:

df.loc[:, 'px'] = (alpha * beta) / df.loc[:, 'time'] * df.loc[:, 'vol']

By the way, your for-loop/lambda solutions are slow because the overhead for each pandas access is large. So accessing each cell separately (via looping over each row) is much slower than accessing the whole column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM