[英]Python Pandas groupby limited cumulative sum
这是我的 dataframe
import pandas as pd
import numpy as np
data = {'c1':[-1,-1,1,1,np.nan,1,1,1,1,1,np.nan,-1],\
'c2':[1,1,1,-1,1,1,-1,-1,1,-1,1,np.nan]}
index = pd.date_range('2000-01-01','2000-03-20', freq='W')
df = pd.DataFrame(index=index, data=data)
>>> df
c1 c2
2000-01-02 -1.0 1.0
2000-01-09 -1.0 1.0
2000-01-16 1.0 1.0
2000-01-23 1.0 -1.0
2000-01-30 NaN 1.0
2000-02-06 1.0 1.0
2000-02-13 1.0 -1.0
2000-02-20 1.0 -1.0
2000-02-27 1.0 1.0
2000-03-05 1.0 -1.0
2000-03-12 NaN 1.0
2000-03-19 -1.0 NaN
这是按月计算的累计金额
df2 = df.groupby(df.index.to_period('m')).cumsum()
>>> df2
c1 c2
2000-01-02 -1.0 1.0
2000-01-09 -2.0 2.0
2000-01-16 -1.0 3.0
2000-01-23 0.0 2.0
2000-01-30 NaN 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 -1.0
2000-02-27 4.0 0.0
2000-03-05 1.0 -1.0
2000-03-12 NaN 0.0
2000-03-19 0.0 NaN
我更需要的是忽略增量,如果它大于 3 或小于 0,就像这样 function
def cumsum2(arr, low=-float('Inf'), high=float('Inf')):
arr2 = np.copy(arr)
sm = 0
for index, elem in np.ndenumerate(arr):
if not np.isnan(elem):
sm += elem
if sm > high:
sm = high
if sm < low:
sm = low
arr2[index] = sm
return arr2
期望的结果是
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
我尝试使用 apply 和 lambda 但不起作用,而且对于大型 dataframe 来说速度很慢。
df.groupby(df.index.to_period('m')).apply(lambda x: cumsum2(x, 0, 3))
怎么了? 有没有更快的方法?
您可以尝试从 itertools accumulate
并使用自定义 function 将值限制在 0 到 3 之间:
from itertools import accumulate
lb = 0 # lower bound
ub = 3 # upper bound
def cumsum2(dfm):
def clip(bal, val):
return np.clip(bal + val, lb, ub)
return list(accumulate(dfm.to_numpy(), clip, initial=0))[1:]
out = df.fillna(0).groupby(df.index.to_period('m')).transform(cumsum2)
Output:
>>> out
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
在这种复杂的情况下,我们可以求助于pandas.Series.rolling
和大小为2
的 window 管道,每个 window 到自定义 function 以将每个临时累积保持在特定阈值内:
def cumsum_tsh(x, low=-float('Inf'), high=float('Inf')):
def f(w):
w[-1] = min(high, max(low, w[0] if w.size == 1 else w[0] + w[1]))
return w[-1]
return x.apply(lambda s: s.rolling(2, min_periods=1).apply(f))
res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
.apply(lambda x: cumsum_tsh(x, 0, 3))
c1 c2
2000-01-02 0.0 1.0
2000-01-09 0.0 2.0
2000-01-16 1.0 3.0
2000-01-23 2.0 2.0
2000-01-30 2.0 3.0
2000-02-06 1.0 1.0
2000-02-13 2.0 0.0
2000-02-20 3.0 0.0
2000-02-27 3.0 1.0
2000-03-05 1.0 0.0
2000-03-12 1.0 1.0
2000-03-19 0.0 1.0
我尝试了各种解决方案,出于某种原因,最快的是处理由 groupby 创建的单列帧。 这是代码,如果它对任何人都有用的话
def cumsum2(frame, low=-float('Inf'), high=float('Inf')):
for col in frame.columns:
sm = 0
xs = []
for e in frame[col]:
sm += e
if sm > high:
sm = high
if sm < low:
sm = low
xs.append(sm)
frame[col] = xs
return frame
res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
.apply(cumsum2,0,3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.