简体   繁体   English

加速pandas groupby中的滚动总和计算

[英]Speeding up rolling sum calculation in pandas groupby

I want to compute rolling sums group-wise for a large number of groups and I'm having trouble doing it acceptably quickly.我想为大量组按组计算滚动总和,但我无法快速完成。

Pandas has build-in methods for rolling and expanding calculations Pandas 具有用于滚动和扩展计算的内置方法

Here's an example:下面是一个例子:

import pandas as pd
import numpy as np
obs_per_g = 20
g = 10000
obs = g * obs_per_g
k = 20
df = pd.DataFrame(
    data=np.random.normal(size=obs * k).reshape(obs, k),
    index=pd.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
)

To get rolling and expanding sums I can use为了获得滚动和扩大我可以使用的金额

df.groupby(level=0).expanding().sum()
df.groupby(level=0).rolling(window=5).sum()

But this takes a long time for a very large number of groups.但是对于大量的组来说,这需要很长时间。 For expanding sums, using instead the pandas method cumsum is almost 60 times quicker (16s vs 280ms for the above example) and turns hours into minutes.为了扩大总和,使用熊猫方法 cumsum 几乎快 60 倍(上面示例为 16 秒对 280 毫秒),并将数小时变为分钟。

df.groupby(level=0).cumsum()

Is there a fast implementation of rolling sum in pandas, like cumsum is for expanding sums?在熊猫中是否有滚动总和的快速实现,就像 cumsum 用于扩大总和一样? If not, could I use numpy to accomplish this?如果没有,我可以使用 numpy 来完成这个吗?

我有同样的经验.rolling()它的漂亮,但只有小的数据集或者如果你申请的功能是非标准,用sum()我会建议使用cumsum()减去cumsum().shift(5)

df.groupby(level=0).cumsum() - df.groupby(level=0).cumsum().shift(5)

To provide the latest information on this, if you upgrade pandas, the performance of groupby rolling has been significantly improved.提供这方面的最新信息,如果升级pandas,groupby 滚动的性能将得到显着提高。 This is approx 4-5 times faster in 1.1.0 and x12 faster in >1.2.0 compared to 0.24 or 1.0.0.与 0.24 或 1.0.0 相比,这在 1.1.0 中快了大约 4-5 倍,在 >1.2.0 中快了 12 倍。

I believe the biggest performance improvement comes from this PR which means it can do more in cython (before it was implemented like groupby.apply(lambda x: x.rolling()) ).我相信最大的性能改进来自这个PR ,这意味着它可以在 cython 中做更多的事情(在它像groupby.apply(lambda x: x.rolling())那样实现之前)。

I used the below code to benchmark:我使用以下代码进行基准测试:

import pandas
import numpy

print(pandas.__version__)
print(numpy.__version__)


def stack_overflow_df():
    obs_per_g = 20
    g = 10000
    obs = g * obs_per_g
    k = 2
    df = pandas.DataFrame(
        data=numpy.random.normal(size=obs * k).reshape(obs, k),
        index=pandas.MultiIndex.from_product(iterables=[range(g), range(obs_per_g)]),
    )
    return df


df = stack_overflow_df()

# N.B. droplevel important to make indices match
rolling_result = (
    df.groupby(level=0)[[0, 1]].rolling(10, min_periods=1).sum().droplevel(level=0)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result
%%timeit
# results:
# numpy version always 1.19.4
# pandas 0.24 = 12.3 seconds
# pandas 1.0.5 = 12.9 seconds
# pandas 1.1.0 = broken with groupby rolling bug
# pandas 1.1.1 = 2.9 seconds
# pandas 1.1.5 = 2.5 seconds
# pandas 1.2.0 = 1.06 seconds
# pandas 1.2.2 = 1.06 seconds

I think care must be taken if trying to use numpy.cumsum to improve performance (regardless of pandas version).我认为如果尝试使用 numpy.cumsum 来提高性能(无论熊猫版本如何),必须小心。 For example, using something like the below:例如,使用如下所示的内容:

# Gives different output
df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].cumsum().shift(10)

While this is much faster, the output is not correct.虽然这要快得多,但输出不正确。 This shift is performed over all rows and mixes the cumsum of different groups.这种转变在所有行上执行,并混合不同组的累积。 ie The first result of the next group is shifted back to the previous group.即下一组的第一个结果被移回上一组。

To have the same behaviour as above, you need to use apply:要具有与上述相同的行为,您需要使用 apply:

df.groupby(level=0)[[0, 1]].cumsum() - df.groupby(level=0)[[0, 1]].apply(
    lambda x: x.cumsum().shift(10).fillna(0)
)

which, in the most recent version (1.2.2), is slower than using rolling directly.在最新版本 (1.2.2) 中,这比直接使用滚动要慢。 Hence, for groupby rolling sums, I don't think numpy.cumsum is the best solution for pandas>=1.1.1因此,对于 groupby 滚动总和,我不认为 numpy.cumsum 是 pandas>=1.1.1 的最佳解决方案

For completeness, if your groups are columns rather than the index, you should use syntax like this:为了完整起见,如果您的组是列而不是索引,则应使用如下语法:

# N.B. reset_index important to make indices match
rolling_result = (
    df.groupby(["category_0", "category_1"])[["value_0", "value_1"]]
    .rolling(10, min_periods=1)
    .sum()
    .reset_index(drop=True)
)
df[["value_0_rolling_sum", "value_1_rolling_sum"]] = rolling_result

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM