[英]Calculate cumulative sum forward pandas
Suppose we have the following dataframe:假设我们有以下 dataframe:
Date Type Country Value
0 2016-04-30 A NL 1
1 2016-04-30 A BE 2
2 2016-04-30 B NL 3
3 2016-04-30 B BE 4
4 2016-04-30 C NL 5
5 2016-04-30 C BE 6
6 2016-04-30 C FR 7
7 2016-04-30 C UK 8
8 2016-05-31 A NL 9
9 2016-05-31 A BE 10
10 2016-05-31 A FR 11
11 2016-05-31 B NL 12
12 2016-05-31 B BE 13
13 2016-05-31 B FR 14
14 2016-05-31 C NL 15
15 2016-05-31 C BE 16
16 2016-05-31 C UK 17
17 2016-05-31 C SL 18
18 2016-06-30 A NL 19
19 2016-06-30 B FR 20
20 2016-06-30 B UK 21
21 2016-06-30 B SL 22
22 2016-06-30 C NL 23
23 2016-06-30 C BE 24
24 2016-07-31 A NL 25
25 2016-07-31 A BE 23
26 2016-07-31 B FR 12
27 2016-07-31 B UK 28
28 2016-07-31 B SL 22
29 2016-07-31 C NL 25
30 2016-07-31 C BE 28
Which corresponds to the following code:对应于以下代码:
df = pd.DataFrame([['2016-04-30','A','NL',1], ['2016-04-30','A', "BE" ,2], ['2016-04-30', 'B', 'NL',3], ['2016-04-30','B','BE',4], ['2016-04-30','C','NL',5], ['2016-04-30','C','BE',6],['2016-04-30','C','FR', 7], ['2016-04-30','C','UK',8], ['2016-05-31','A','NL',9], ['2016-05-31','A','BE',10], ['2016-05-31','A','FR',11], ['2016-05-31','B','NL',12], ['2016-05-31','B','BE',13], ['2016-05-31','B','FR',14], ['2016-05-31','C','NL',15], ['2016-05-31','C','BE',16], ['2016-05-31','C','UK',17], ['2016-05-31','C','SL',18], ['2016-06-30','A','NL',19], ['2016-06-30','B','FR',20], ['2016-06-30','B','UK',21], ['2016-06-30','B','SL',22], ['2016-06-30','C','NL',23], ['2016-06-30','C','BE',24], ['2016-07-31', 'A', 'NL', 25], ['2016-07-31', 'A', 'BE', 23], ['2016-07-31', 'B', 'FR',12], ['2016-07-31','B','UK', 28], ['2016-07-31','B', 'SL',22], ['2016-07-31', 'C', 'NL', 25], ['2016-07-31', 'C', 'BE', 28] ], columns=['Date','Type' ,'Country' ,'Value'])
I want to create an additional column 'CumValue', which computes the cumulative sum of the next K months (in this case lets say K=3, but I would like it to be general).我想创建一个额外的列“CumValue”,它计算接下来 K 个月的累积总和(在这种情况下,假设 K=3,但我希望它是通用的)。 So for example, for observation [2016-04-30, A, NL], I would want the CumValue to be 1 + 9 + 19 = 28 (so we include the initial month).
例如,对于观察 [2016-04-30, A, NL],我希望 CumValue 为 1 + 9 + 19 = 28(因此我们包括最初的月份)。 Suppose for instance that the observation two months ahead is not available, then we set the value equal to NaN.
例如,假设两个月前的观测不可用,那么我们将值设置为 NaN。
I would want the end product to look as follows:我希望最终产品如下所示:
Date Type Country Value CumValue
0 2016-04-30 A NL 1 29
1 2016-04-30 A BE 2 NaN
2 2016-04-30 B NL 3 NaN
3 2016-04-30 B BE 4 NaN
4 2016-04-30 C NL 5 43
5 2016-04-30 C BE 6 46
6 2016-04-30 C FR 7 NaN
7 2016-04-30 C UK 8 NaN
8 2016-05-31 A NL 9 53
9 2016-05-31 A BE 10 NaN
10 2016-05-31 A FR 11 NaN
11 2016-05-31 B NL 12 NaN
12 2016-05-31 B BE 13 NaN
13 2016-05-31 B FR 14 46
14 2016-05-31 C NL 15 63
15 2016-05-31 C BE 16 68
16 2016-05-31 C UK 17 NaN
17 2016-05-31 C SL 18 NaN
18 2016-06-30 A NL 19 NaN
19 2016-06-30 B FR 20 NaN
20 2016-06-30 B UK 21 NaN
21 2016-06-30 B SL 22 NaN
22 2016-06-30 C NL 23 NaN
23 2016-06-30 C BE 24 NaN
24 2016-07-31 A NL 25 NaN
25 2016-07-31 A BE 23 NaN
26 2016-07-31 B FR 12 NaN
27 2016-07-31 B UK 28 NaN
28 2016-07-31 B SL 22 NaN
29 2016-07-31 C NL 25 NaN
30 2016-07-31 C BE 28 NaN
Does anyone know an efficient way to do something like this?有谁知道一种有效的方法来做这样的事情?
You can try the below code.你可以试试下面的代码。 I checked the output for (NL,A), (NL,C), (NL,BE), and it seems to work.
我检查了 output 的(NL,A),(NL,C),(NL,BE),它似乎工作。
def shift_cum(x,k=3):
return x.rolling(k).sum().shift(-2)
df.assign(CumValue=df.groupby(['Country','Type'])['Value'].apply(shift_cum))
Here we are passing the value, k
in the function with default 3, which you can change when applying.这里我们在 function 中传递值
k
,默认为 3,您可以在应用时更改。 The function first takes the rolling sum within the group and then shifts it back 2 positions to get match your requirement. function 首先取组内的滚动总和,然后将其移回 2 个位置以符合您的要求。
Yati Raj's solution only works if all month values are continuously available. Yati Raj 的解决方案仅在所有月份值都连续可用的情况下才有效。 The OP states:
OP声明:
Suppose for instance that the observation two months ahead is not available, then we set the value equal to NaN
例如,假设两个月前的观测不可用,那么我们将值设置为 NaN
This is the case for Type
'A', Country
'BE': there are not data for 2016-06-30 available and hence the result should be NaN. Type
“A”, Country
“BE”就是这种情况:没有 2016-06-30 的可用数据,因此结果应该是 NaN。 In order to make it work for this case too, you can modify the solution as follows:为了使其也适用于这种情况,您可以按如下方式修改解决方案:
pd.merge(df, df.set_index('Date').groupby(['Type', 'Country']).Value.apply(lambda x: x.asfreq('1M').rolling(3).sum().shift(-2)).reset_index(), on=['Type', 'Country', 'Date']).rename(columns={'Value_x': 'Value', 'Value_y': 'CumValue'})
This yields the correct result for the second row as given in the OP:这会产生 OP 中给出的第二行的正确结果:
Date Type Country Value CumValue
0 2016-04-30 A NL 1 29.0
1 2016-04-30 A BE 2 NaN
...
(the accepted answer gave a CumValue of 35 here) (接受的答案在这里给出的 CumValue 为 35)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.