简体   繁体   English

Pandas groupby 累计和

[英]Pandas groupby cumulative sum

I would like to add a cumulative sum column to my Pandas dataframe so that:我想在我的 Pandas dataframe 中添加一个累积总和列,以便:

name姓名 day no
Jack杰克 Monday周一 10 10
Jack杰克 Tuesday周二 20 20
Jack杰克 Tuesday周二 10 10
Jack杰克 Wednesday周三 50 50
Jill吉尔 Monday周一 40 40
Jill吉尔 Wednesday周三 110 110

becomes:变成:

Jack | Monday     | 10  | 10
Jack | Tuesday    | 30  | 40
Jack | Wednesday  | 50  | 90
Jill | Monday     | 40  | 40
Jill | Wednesday  | 110 | 150

I tried various combos of df.groupby and df.agg(lambda x: cumsum(x)) to no avail.我尝试了df.groupbydf.agg(lambda x: cumsum(x))的各种组合,但无济于事。

This should do it, need groupby() twice:这应该可以,需要groupby()两次:

df.groupby(['name', 'day']).sum() \
  .groupby(level=0).cumsum().reset_index()

Explanation:解释:

print(df)
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

# sum per name/day
print( df.groupby(['name', 'day']).sum() )
                 no
name day           
Jack Monday      10
     Tuesday     30
     Wednesday   50
Jill Monday      40
      Wednesday  110

# cumulative sum per name/day
print( df.groupby(['name', 'day']).sum() \
         .groupby(level=0).cumsum() )
                 no
name day           
Jack Monday      10
     Tuesday     40
     Wednesday   90
Jill Monday      40
     Wednesday  150

The dataframe resulting from the first sum is indexed by 'name' and by 'day' .由第一个总和产生的数据帧由'name''day'索引。 You can see it by printing你可以通过打印看到它

df.groupby(['name', 'day']).sum().index 

When computing the cumulative sum, you want to do so by 'name' , corresponding to the first index (level 0).在计算累积总和时,您希望通过'name' ,对应于第一个索引(级别 0)。

Finally, use reset_index to have the names repeated.最后,使用reset_index使名称重复。

df.groupby(['name', 'day']).sum().groupby(level=0).cumsum().reset_index()

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   40
2  Jack  Wednesday   90
3  Jill     Monday   40
4  Jill  Wednesday  150

This works in pandas 0.16.2这适用于熊猫 0.16.2

In[23]: print df
        name          day   no
0      Jack       Monday    10
1      Jack      Tuesday    20
2      Jack      Tuesday    10
3      Jack    Wednesday    50
4      Jill       Monday    40
5      Jill    Wednesday   110
In[24]: df['no_cumulative'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
In[25]: print df
        name          day   no  no_cumulative
0      Jack       Monday    10             10
1      Jack      Tuesday    20             30
2      Jack      Tuesday    10             40
3      Jack    Wednesday    50             90
4      Jill       Monday    40             40
5      Jill    Wednesday   110            150

Modification to @Dmitry's answer.修改@Dmitry 的回答。 This is simpler and works in pandas 0.19.0:这更简单,适用于 Pandas 0.19.0:

print(df) 

 name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110

df['no_csum'] = df.groupby(['name'])['no'].cumsum()

print(df)
   name        day   no  no_csum
0  Jack     Monday   10       10
1  Jack    Tuesday   20       30
2  Jack    Tuesday   10       40
3  Jack  Wednesday   50       90
4  Jill     Monday   40       40
5  Jill  Wednesday  110      150

you should use你应该使用

df['cum_no'] = df.no.cumsum()

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.cumsum.html

Another way of doing it另一种方法

import pandas as pd
df = pd.DataFrame({'C1' : ['a','a','a','b','b'],
           'C2' : [1,2,3,4,5]})
df['cumsum'] = df.groupby(by=['C1'])['C2'].transform(lambda x: x.cumsum())
df

在此处输入图片说明

Instead of df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() (see above) you could also do a df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()而不是df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum() (见上文)你也可以做一个df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

  • df.groupby(by=['name','day']).sum() is actually just moving both columns to a MultiIndex df.groupby(by=['name','day']).sum()实际上只是将两列移动到 MultiIndex
  • as_index=False means you do not need to call reset_index afterwards as_index=False意味着您之后不需要调用 reset_index

data.csv:数据.csv:

name,day,no
Jack,Monday,10
Jack,Tuesday,20
Jack,Tuesday,10
Jack,Wednesday,50
Jill,Monday,40
Jill,Wednesday,110

Code:代码:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv')
print(df)
df = df.groupby(['name', 'day'])['no'].sum().reset_index()
print(df)
df['cumsum'] = df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
print(df)

Output:输出:

   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   20
2  Jack    Tuesday   10
3  Jack  Wednesday   50
4  Jill     Monday   40
5  Jill  Wednesday  110
   name        day   no
0  Jack     Monday   10
1  Jack    Tuesday   30
2  Jack  Wednesday   50
3  Jill     Monday   40
4  Jill  Wednesday  110
   name        day   no  cumsum
0  Jack     Monday   10      10
1  Jack    Tuesday   30      40
2  Jack  Wednesday   50      90
3  Jill     Monday   40      40
4  Jill  Wednesday  110     150

as of version 1.0 pandas got a new api for window functions.从 1.0 版开始,pandas 为 window 功能获得了新的 api。

specifically, what was achieved earlier with具体来说,早先取得的成就
df.groupby(['name'])['no'].apply(lambda x: x.cumsum()) df.groupby(['name'])['no'].apply(lambda x: x.cumsum())
or或者
df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum() df.set_index(['name', 'day']).groupby(level=0, as_index=False).cumsum()

now becomes现在变成
df.groupby(['name'])['no'].expanding().sum() df.groupby(['name'])['no'].expanding().sum()

i find it more intuitive for all window-related functions than groupby+level operations我发现所有与窗口相关的功能都比 groupby+level 操作更直观

although learning to use groupby is useful for general purpose.尽管学习使用 groupby 对于一般用途很有用。
see docs: https://pandas.pydata.org/docs/user_guide/window.html请参阅文档: https://pandas.pydata.org/docs/user_guide/window.html

If you want to write a one-liner (perhaps you want to pass the methods into a pipeline), you can do so by first setting as_index parameter of groupby method to False to return a dataframe from the aggregation step and use assign() to assign a new column to it (the cumulative sum for each person).如果你想写一个单线(也许你想将方法传递到管道中),你可以通过首先将groupby方法的as_index参数设置为 False 以从聚合步骤返回 dataframe 并使用assign()来实现为其分配一个新列(每个人的累积总和)。

These chained methods return a new dataframe, so you'll need to assign it to a variable (eg agg_df ) to be able to use it later on.这些链接的方法返回一个新的 dataframe,因此您需要将其分配给一个变量(例如agg_df )以便以后能够使用它。

agg_df = (
    # aggregate df by name and day
    df.groupby(['name','day'], as_index=False)['no'].sum()
    .assign(
        # assign the cumulative sum of each name as a new column
        cumulative_sum=lambda x: x.groupby('name')['no'].cumsum()
    )
)

资源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM