简体   繁体   中英

Groupby sum vs cumsum - returns dataframe vs series

I noticed that when we groupby a dataframe & sum it we get a full dataframe in return:

dict1 = {'A': {0: 'A0', 1: 'A0', 2: 'A0', 3: 'A0', 4: 'A1', 5: 'A1', 6: 'A1', 7: 'A1', 8: 'A1', 9: 'A1'}, 'B': {0: 'B0', 1: 'B1', 2: 'B2', 3: 'B3', 4: 'B4', 5: 'B5', 6: 'B6', 7: 'B7', 8: 'B8', 9: 'B9'}, 'C': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9}, 'D': {0: 10, 1: 11, 2: 12, 3: 13, 4: 14, 5: 15, 6: 16, 7: 17, 8: 18, 9: 19}, 'E': {0: 'E0', 1: 'E1', 2: 'E0', 3: 'E1', 4: 'E0', 5: 'E1', 6: 'E0', 7: 'E1', 8: 'E0', 9: 'E1'}}

df2 = pd.DataFrame(dict1)
A   E 
A0  E0    22
    E1    24
A1  E0    48
    E1    51
Name: D, dtype: int64

But when I do a cumsum it returns only the resultant cumulative series. Why do they behave differently? And how can I make cumsum return along with the grouped dataframe without assigning it back?

df2.groupby(['A','E'])['D'].cumsum()
0    10
1    11
2    22
3    24
4    14
5    15
6    30
7    32
8    48
9    51
Name: D, dtype: int64

Edit: I thought it would be an easy fix and I will be able to handle the rest of it. But based on your comments so far, it takes me away from my end goal. I want to ultimately achieve sum, mean, cumsum on multiple variables in one group-by like this:

df2.groupby(['A','E']).agg({'D':'cumsum','C':lambda x: 4*np.sum(x)})

But it gives an output like this:

    D   C
0   10.0    NaN
1   11.0    NaN
2   22.0    NaN
3   24.0    NaN
4   14.0    NaN
5   15.0    NaN
6   30.0    NaN
7   32.0    NaN
8   48.0    NaN
9   51.0    NaN
(A0, E0)    NaN 8.0
(A0, E1)    NaN 16.0
(A1, E0)    NaN 72.0
(A1, E1)    NaN 84.0

So isn't there a way it can be achieved without separately handling cumsum?

It's easy to understand this based on the behavior you are already seeing in your 2 scripts. pd.Series.cumsum() returns another series of the same length as column D for each group, while your lambda function returns a single value for each group. This causes a difference in the returned indexes.

All you have to do is to use another lambda function to capture the complete cumsum operation at the level of each group. This lambda function returns a list object as an aggregation instead of a series output as a transformation.

t = {
    'D': lambda x: list(x.cumsum()),
    'C': lambda x: 4*np.sum(x)
    }

result = df2.groupby(['A','E']).agg(t)
result
                  D   C
A  E                   
A0 E0      [10, 22]   8
   E1      [11, 24]  16
A1 E0  [14, 30, 48]  72
   E1  [15, 32, 51]  84

This will return a dataframe at the level of your groups formed by A and E columns.

However, if you want the dataframe with the same indexes as the original, you can just explode the new column D

t = {
    'D': lambda x: list(x.cumsum()),
    'C': lambda x: 4*np.sum(x)
    }

result = df2.groupby(['A','E']).agg(t).explode('D')
result
        D   C
A  E         
A0 E0  10   8
   E0  22   8
   E1  11  16
   E1  24  16
A1 E0  14  72
   E0  30  72
   E0  48  72
   E1  15  84
   E1  32  84
   E1  51  84

EDIT 1: Additional info based on my comments

  • Simply put, sum is an aggregation and returns a single value (float/int) for each group, while cumsum is a transformation and returns a series with the same number of rows as the input.

  • cumsum basically transforms the given input series (rows for column D for each group) and returns another series.

  • The sum returns series with the indexes as the first script in my answer, and the cumsum , returns the indexes as the second script in my answer. When Pandas tries to reconcile them, it stacks the indexes because they don't match.

  • For example, for the group (A1,E0) -> cumsum returns a series with 3 values [14, 30, 48] while sum returns an aggregation of the value 72

EDIT 2: Code with transform on groupby as per your comments

If you want to avoid using lambda functions, as I understand from your comments, you can use transform method on a groupby object but this doesn't allow passing multiple transformations for different columns at once as a dict. So you will still have to reassign these columns.

grouper = df2.groupby(['A','E'])                #<- create grouper

df2['C_new'] = grouper['C'].transform('sum')    #<- use your lambda function here if you need
df2['D_new'] = grouper['D'].transform('cumsum') #<- transformation here
print(df2)
    A   B  C   D   E  C_new  D_new
0  A0  B0  0  10  E0      2     10
1  A0  B1  1  11  E1      4     11
2  A0  B2  2  12  E0      2     22
3  A0  B3  3  13  E1      4     24
4  A1  B4  4  14  E0     18     14
5  A1  B5  5  15  E1     21     15
6  A1  B6  6  16  E0     18     30
7  A1  B7  7  17  E1     21     32
8  A1  B8  8  18  E0     18     48
9  A1  B9  9  19  E1     21     51

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM