简体   繁体   English

提高 df.rolling(...).apply(...) 对大型数据帧的性能

[英]Increase performance of df.rolling(...).apply(...) for large dataframes

Execution time of this code is too long.此代码的执行时间太长。

df.rolling(window=255).apply(myFunc)

My dataframes shape is (500, 10000).我的数据框形状是 (500, 10000)。

                   0         1 ... 9999
2021-11-01  0.011111  0.054242 
2021-11-04  0.025244  0.003653 
2021-11-05  0.524521  0.099521 
2021-11-06  0.054241  0.138321 
...

I make the calculation for each date with the last 255 date values.我用最后 255 个日期值对每个日期进行计算。 myFunc looks like:我的函数看起来像:

def myFunc(x):
   coefs = ...
   return np.sqrt(np.sum(x ** 2 * coefs))

I tried to use swifter but performances are the same:我尝试使用 swifter 但性能是一样的:

import swifter
df.swifter.rolling(window=255).apply(myFunc)

I also tried with Dask, but I think I didn't understand it well because the performance are not much better:我也尝试过 Dask,但我认为我不太了解它,因为性能并没有好多少:

import dask.dataframe as dd
ddf = dd.from_pandas(df)
ddf = ddf.rolling(window=255).apply(myFunc, raw=False)
ddf.execute()

I didn't manage to parallelize the execution with partitions.我没有设法将执行与分区并行化。 How can I use dask to improve performance?如何使用 dask 来提高性能? I'm on Windows.我在 Windows。

This can be done using numpy + numba pretty efficiently.这可以使用numpy + numba非常有效地完成。

Quick MRE:快速 MRE:

import numpy as np, pandas as pd, numba

df = pd.DataFrame(
    np.random.random(size=(500, 10000)),
    index=pd.date_range("2021-11-01", freq="D", periods=500)
)

coefs = np.random.random(size=255)

Write the function using pure numpy operations and simple loops, making use of numba.njit(parallel=True) and numba.prange :使用纯 numpy 操作和简单循环编写 function,利用numba.njit(parallel=True)numba.prange

@numba.njit(parallel=True)
def numba_func(values, coefficients):
    # define result array: size of original, minus length of
    # coefficients, + 1
    result_tmp = np.zeros(
        shape=(values.shape[0] - len(coefficients) + 1, values.shape[1]),
        dtype=values.dtype,
    )

    result_final = np.empty_like(result_tmp)

    # nested for loops are your friend with numba!
    # (you must unlearn what you have learned)
    for j in numba.prange(values.shape[1]):
        for i in range(values.shape[0] - len(coefficients) + 1):
            for k in range(len(coefficients)):
                result_tmp[i, j] += values[i + k, j] ** 2 * coefficients[k]

        result_final[:, j] = np.sqrt(result_tmp[:, j])

    return result_final

This runs very quickly:这运行得非常快:

In [5]: %%time
   ...: result = pd.DataFrame(
   ...:     numba_func(df.values, coefs),
   ...:     index=df.index[len(coefs) - 1:],
   ...: )
   ...:
   ...:
CPU times: user 1.69 s, sys: 40.9 ms, total: 1.73 s
Wall time: 844 ms

Note: I'm a huge fan of dask.注意:我是 dask 的超级粉丝。 But the first rule of dask performance is don't use dask .但是 dask 性能的第一条规则是 不要使用 dask If it's small enough to fit comfortably into memory, you'll usually get the best performance from tuning your pandas or numpy operations and leveraging speedups from cython, numba, etc. And once a problem is big enough to move to dask, these same tuning rules apply to the operations you perform on dask chunks/partitions, too!如果它足够小,可以轻松放入 memory,您通常会通过调整 pandas 或 numpy 操作并利用 cython、numba 等的加速来获得最佳性能。一旦问题大到足以转移到 dask,这些相同的调整规则也适用于您在 dask 块/分区上执行的操作!

First, since you are using numpy functions, specify the parameter raw=True .首先,由于您使用的是numpy函数,因此指定参数raw=True Toy example:玩具示例:

import pandas as pd
import numpy as np

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 10000)))

%%time
res = df.rolling(250).apply(foo)

Wall time: 359.3 s

# with raw=True
%%time
res = df.rolling(250).apply(foo, raw=True)

Wall time: 15.2 s

You can also easily parallelize your calculations using the parallel-pandas library.您还可以使用parallel-pandas库轻松并行化计算。 Only two additional lines of code!只有两行额外的代码!

# pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=8, disable_pr_bar=True)

def foo(x):
    coefs = 2
    return np.sqrt(np.sum(x ** 2 * coefs))    

df = pd.DataFrame(np.random.random((500, 1000)))

# p_apply - is parallel analogue of apply method
%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes')

Wall time: 2.2 s

With engine='numba'随着engine='numba'

%%time
res = df.rolling(250).p_apply(foo, raw=True, executor='processes', engine='numba')

Wall time: 1.2 s

Total speedup is 359/1.2 ~ 300 !总加速比为359/1.2 ~ 300

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM