[英]Pandas groupby cumulative sum
I would like to add a cumulative sum column to my Pandas dataframe so that:我想在我的 Pandas dataframe 中添加一个累积总和列,以便:
name![]() |
day![]() |
no![]() |
---|---|---|
Jack![]() |
Monday![]() |
10 ![]() |
Jack![]() |
Tuesday![]() |
20 ![]() |
Jack![]() |
Tuesday![]() |
10 ![]() |
Jack![]() |
Wednesday![]() |
50 ![]() |
Jill![]() |
Monday![]() |
40 ![]() |
Jill![]() |
Wednesday![]() |
110 ![]() |
becomes:变成:
Jack | Monday | 10 | 10
Jack | Tuesday | 30 | 40
Jack | Wednesday | 50 | 90
Jill | Monday | 40 | 40
Jill | Wednesday | 110 | 150
I tried various combos of df.groupby
and df.agg(lambda x: cumsum(x))
to no avail.我尝试了
df.groupby
和df.agg(lambda x: cumsum(x))
的各种组合,但无济于事。
This should do it, need groupby()
twice:这应该可以,需要
groupby()
两次:
df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum().reset_index()
Explanation:解释:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
# sum per name/day
print( df.groupby(['name', 'day']).sum() )
no
name day
Jack Monday 10
Tuesday 30
Wednesday 50
Jill Monday 40
Wednesday 110
# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
.groupby(level=0).cumsum() )
no
name day
Jack Monday 10
Tuesday 40
Wednesday 90
Jill Monday 40
Wednesday 150
The dataframe resulting from the first sum is indexed by 'name'
and by 'day'
.由第一个总和产生的数据帧由
'name'
和'day'
索引。 You can see it by printing你可以通过打印看到它
df.groupby(['name', 'day']).sum().index
When computing the cumulative sum, you want to do so by 'name'
, corresponding to the first index (level 0).在计算累积总和时,您希望通过
'name'
,对应于第一个索引(级别 0)。
Finally, use reset_index
to have the names repeated.最后,使用
reset_index
使名称重复。
df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()
name day no
0 Jack Monday 10
1 Jack Tuesday 40
2 Jack Wednesday 90
3 Jill Monday 40
4 Jill Wednesday 150
This works in pandas 0.16.2这适用于熊猫 0.16.2
In[23]: print df
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
name day no no_cumulative
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
Modification to @Dmitry's answer.修改@Dmitry 的回答。 This is simpler and works in pandas 0.19.0:
这更简单,适用于 Pandas 0.19.0:
print(df)
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
df['no_csum'] = df.groupby(['name'])['no'].cumsum()
print(df)
name day no no_csum
0 Jack Monday 10 10
1 Jack Tuesday 20 30
2 Jack Tuesday 10 40
3 Jack Wednesday 50 90
4 Jill Monday 40 40
5 Jill Wednesday 110 150
you should use你应该使用
df['cum_no'] = df.no.cumsum()
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html
Another way of doing it另一种方法
import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df
Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
(see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
而不是
df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()
(见上文)你也可以做一个df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
df.groupby(by=['name','day']).sum()
is actually just moving both columns to a MultiIndex df.groupby(by=['name','day']).sum()
实际上只是将两列移动到 MultiIndexas_index=False
means you do not need to call reset_index afterwards as_index=False
意味着您之后不需要调用 reset_indexdata.csv:数据.csv:
name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110
Code:代码:
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)
Output:输出:
name day no
0 Jack Monday 10
1 Jack Tuesday 20
2 Jack Tuesday 10
3 Jack Wednesday 50
4 Jill Monday 40
5 Jill Wednesday 110
name day no
0 Jack Monday 10
1 Jack Tuesday 30
2 Jack Wednesday 50
3 Jill Monday 40
4 Jill Wednesday 110
name day no cumsum
0 Jack Monday 10 10
1 Jack Tuesday 30 40
2 Jack Wednesday 50 90
3 Jill Monday 40 40
4 Jill Wednesday 110 150
as of version 1.0 pandas got a new api for window functions.从 1.0 版开始,pandas 为 window 功能获得了新的 api。
specifically, what was achieved earlier with具体来说,早先取得的成就
df.groupby(['name'])['no'].apply(lambda x: x.cumsum()) df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
or或者
df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum() df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()
now becomes现在变成
df.groupby(['name'])['no'].expanding().sum() df.groupby(['name'])['no'].expanding().sum()
i find it more intuitive for all window-related functions than groupby+level operations我发现所有与窗口相关的功能都比 groupby+level 操作更直观
although learning to use groupby is useful for general purpose.尽管学习使用 groupby 对于一般用途很有用。
see docs: https://pandas.pydata.org/docs/user_guide/window.html请参阅文档: https://pandas.pydata.org/docs/user_guide/window.html
If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index
parameter of groupby
method to False to return a dataframe from the aggregation step and use assign()
to assign a new column to it (the cumulative sum for each person).如果你想写一个单线(也许你想将方法传递到管道中),你可以通过首先将
groupby
方法的as_index
参数设置为 False 以从聚合步骤返回 dataframe 并使用assign()
来实现为其分配一个新列(每个人的累积总和)。
These chained methods return a new dataframe, so you'll need to assign it to a variable (eg agg_df
) to be able to use it later on.这些链接的方法返回一个新的 dataframe,因此您需要将其分配给一个变量(例如
agg_df
)以便以后能够使用它。
agg_df = (
# aggregate df by name and day
df.groupby(['name','day'], as_index=False)['no'].sum()
.assign(
# assign the cumulative sum of each name as a new column
cumulative_sum=lambda x: x.groupby('name')['no'].cumsum()
)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.