繁体   English   中英

Python Pandas groupby有限累计和

[英]Python Pandas groupby limited cumulative sum

这是我的 dataframe

import pandas as pd
import numpy as np

data = {'c1':[-1,-1,1,1,np.nan,1,1,1,1,1,np.nan,-1],\
        'c2':[1,1,1,-1,1,1,-1,-1,1,-1,1,np.nan]}

index = pd.date_range('2000-01-01','2000-03-20', freq='W')

df = pd.DataFrame(index=index, data=data)


>>> df
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -1.0  1.0
2000-01-16  1.0  1.0
2000-01-23  1.0 -1.0
2000-01-30  NaN  1.0
2000-02-06  1.0  1.0
2000-02-13  1.0 -1.0
2000-02-20  1.0 -1.0
2000-02-27  1.0  1.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  1.0
2000-03-19 -1.0  NaN

这是按月计算的累计金额

df2 = df.groupby(df.index.to_period('m')).cumsum()

>>> df2
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -2.0  2.0
2000-01-16 -1.0  3.0
2000-01-23  0.0  2.0
2000-01-30  NaN  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0 -1.0
2000-02-27  4.0  0.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  0.0
2000-03-19  0.0  NaN

我更需要的是忽略增量,如果它大于 3 或小于 0,就像这样 function

def cumsum2(arr, low=-float('Inf'), high=float('Inf')):
    arr2 = np.copy(arr)
    sm = 0
    for index, elem in np.ndenumerate(arr):
        if not np.isnan(elem):
            sm += elem
            if sm > high:
                sm = high
            if sm < low:
                sm = low
        arr2[index] = sm
    return arr2

期望的结果是

             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

我尝试使用 apply 和 lambda 但不起作用,而且对于大型 dataframe 来说速度很慢。

df.groupby(df.index.to_period('m')).apply(lambda x: cumsum2(x, 0, 3))

怎么了? 有没有更快的方法?

您可以尝试从 itertools accumulate并使用自定义 function 将值限制在 0 到 3 之间:

from itertools import accumulate

lb = 0  # lower bound
ub = 3  # upper bound

def cumsum2(dfm):
    def clip(bal, val):
        return np.clip(bal + val, lb, ub)
    return list(accumulate(dfm.to_numpy(), clip, initial=0))[1:]

out = df.fillna(0).groupby(df.index.to_period('m')).transform(cumsum2)

Output:

>>> out
             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

在这种复杂的情况下,我们可以求助于pandas.Series.rolling和大小为2的 window 管道,每个 window 到自定义 function 以将每个临时累积保持在特定阈值内:

def cumsum_tsh(x, low=-float('Inf'), high=float('Inf')):
    def f(w):
        w[-1] = min(high, max(low, w[0] if w.size == 1 else w[0] + w[1]))
        return w[-1]
    return x.apply(lambda s: s.rolling(2, min_periods=1).apply(f))

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
    .apply(lambda x: cumsum_tsh(x, 0, 3))

             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

我尝试了各种解决方案,出于某种原因,最快的是处理由 groupby 创建的单列帧。 这是代码,如果它对任何人都有用的话

def cumsum2(frame, low=-float('Inf'), high=float('Inf')):
    for col in frame.columns:
        sm = 0
        xs = []
        for e in frame[col]:
            sm += e
            if sm > high:
                sm = high
            if sm < low:
                sm = low
            xs.append(sm)
        frame[col] = xs
    return frame

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
                                            .apply(cumsum2,0,3)  

                   

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM