简体   繁体   中英

Why does groupby operations behave differently

When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.

Say we have a dataframe with the following information:

    Name   Type  ID
0  Book1  ebook   1
1  Book2  paper   2
2  Book3  paper   3
3  Book1  ebook   1
4  Book2  paper   2

if we do

df.groupby(["Name", "Type"]).sum()  

we get a DataFrame :

             ID
Name  Type     
Book1 ebook   2
Book2 paper   4
Book3 paper   3

which contains a MultiIndex with the columns used in the groupby:

MultiIndex([('Book1', 'ebook'),
            ('Book2', 'paper'),
            ('Book3', 'paper')],
           names=['Name', 'Type'])

and one column called ID .

but if I apply a size() function, the result is a Series :

Name   Type 
Book1  ebook    2
Book2  paper    2
Book3  paper    1
dtype: int64

And at last, if I do a pct_change() , we get only the resulting DataFrame column:

    ID
0   NaN
1   NaN
2   NaN
3   0.0
4   0.0

TL;DR. I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.

From the document

Size :

 Returns Series Number of rows in each group.

For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key

df.groupby(["Name", "Type"])['ID'].sum()  # return Series

Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean , sum they are agg, return with the value and groupby key as index

The outputs are different because the aggregations are different, and those are what mostly control what is returned. Think of the array equivalent. The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input

import numpy as np
np.array([1,2,3]).sum()
#6

np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)

The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).

gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...

The other important part here is that we have a DataFrame GroupBy object. There are also Series GroupBy objects, and that difference can change the return.

gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>

So what happens when you aggregate?

With a DataFrameGroupBy when you choose an aggregation (like sum ) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys. The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.

gp.sum()
#             ID
#Name  Type     
#Book1 ebook   2
#Book2 paper   4
#Book3 paper   3

On the other hand if you use a SeriesGroupBy object (select a single column with [] ) then you'll get a Series back, again with the index of unique group keys.

df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|

#Name   Type 
#Book1  ebook    2
#Book2  paper    4
#Book3  paper    3
#Name: ID, dtype: int64

For aggregations that return arrays (like cumsum , pct_change ) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series. But the index is no longer the unique group keys. This is because that would make little sense; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation. This makes creating these columns very simple as pandas handles all of the alignment

df['ID_pct_change'] = gp.pct_change()

#    Name   Type  ID  ID_pct_change
#0  Book1  ebook   1            NaN  
#1  Book2  paper   2            NaN   
#2  Book3  paper   3            NaN   
#3  Book1  ebook   1            0.0  # Calculated from row 0 and aligned.
#4  Book2  paper   2            0.0

But what about size ? That one is a bit weird . The size of a group is a scalar. It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant. As a result pandas will always return a Series . Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.

gp.size()
#Name   Type 
#Book1  ebook    2
#Book2  paper    2
#Book3  paper    1
#dtype: int64

Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align. You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations. Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input

gp.transform('sum')
#   ID
#0   2    # Row 0 is Book1 ebook which has a group sum of 2
#1   4
#2   3
#3   2    # Row 3 is also Book1 ebook which has a group sum of 2
#4   4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM