Increase performance of df.rolling(...).apply(...) for large dataframes

Execution time of this code is too long.


My dataframes shape is (500, 10000).

                   0         1 ... 9999
2021-11-01  0.011111  0.054242 
2021-11-04  0.025244  0.003653 
2021-11-05  0.524521  0.099521 
2021-11-06  0.054241  0.138321 

I make the calculation for each date with the last 255 date values. myFunc looks like:

def myFunc(x):
   coefs = ...
   return np.sqrt(np.sum(x ** 2 * coefs))

I tried to use swifter but performances are the same:

import swifter

I also tried with Dask, but I think I didn't understand it well because the performance are not much better:

import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)

I didn't manage to parallelize the execution with partitions. How can I use dask to improve performance? I'm on Windows.

This can be done using numpy + numba pretty efficiently.

Quick MRE:

import numpy as np, pandas as pd, numba

df = pd.DataFrame(
    np.random.random(size=(500, 10000)),
    index=pd.date_range("2021-11-01", freq="D", periods=500)

coefs = np.random.random(size=255)

Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True) and numba.prange :

def numba_func(values, coefficients):
    # define result array: size of original, minus length of
    # coefficients, + 1
    result_tmp = np.zeros(
        shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),

    result_final = np.empty_like(result_tmp)

    # nested for loops are your friend with numba!
    # (you must unlearn what you have learned)
    for j in numba.prange(values.shape[1]):
        for i in range(values.shape[0] - len(coefficients) + 1):
            for k in range(len(coefficients)):
                result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]

        result_final[:, j] = np.sqrt(result_tmp[:, j])

    return result_final

This runs very quickly:

In [5]: %%time
   ...: result = pd.DataFrame(
   ...:     numba_func(df.values, coefs),
   ...:     index=df.index[len(coefs) - 1:],
   ...: )
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms

Note: I'm a huge fan of dask. But the first rule of dask performance is don't use dask . If it's small enough to fit comfortably into memory, you'll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!

First, since you are using numpy functions, specify the parameter raw=True . Toy example:

import pandas as pd
import numpy as np

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 10000)))

res = df.rolling(250).apply(foo)

Wall time: 359.3 s

# with raw=True
res = df.rolling(250).apply(foo, raw=True)

Wall time: 15.2 s

You can also easily parallelize your calculations using the parallel-pandas library. Only two additional lines of code!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 1000)))

# p_apply - is parallel analogue of apply method
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')

Wall time: 2.2 s

With engine='numba'

res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')

Wall time: 1.2 s

Total speedup is 359/1.2 ~ 300 !

