简体   繁体   English

为什么 groupby 操作的行为不同

[英]Why does groupby operations behave differently

When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.当使用pandas groupby 函数并在 groupby 之后操作 output 时,我注意到某些函数在作为索引返回的内容以及如何操作方面表现不同。

Say we have a dataframe with the following information:假设我们有一个 dataframe,其中包含以下信息:

    Name   Type  ID
0  Book1  ebook   1
1  Book2  paper   2
2  Book3  paper   3
3  Book1  ebook   1
4  Book2  paper   2

if we do如果我们这样做

df.groupby(["Name", "Type"]).sum()  

we get a DataFrame :我们得到一个DataFrame

             ID
Name  Type     
Book1 ebook   2
Book2 paper   4
Book3 paper   3

which contains a MultiIndex with the columns used in the groupby:其中包含一个 MultiIndex,其中包含 groupby 中使用的列:

MultiIndex([('Book1', 'ebook'),
            ('Book2', 'paper'),
            ('Book3', 'paper')],
           names=['Name', 'Type'])

and one column called ID .和一列称为ID

but if I apply a size() function, the result is a Series :但如果我应用size() function,结果是一个Series

Name   Type 
Book1  ebook    2
Book2  paper    2
Book3  paper    1
dtype: int64

And at last, if I do a pct_change() , we get only the resulting DataFrame column:最后,如果我执行pct_change() ,我们只会得到结果 DataFrame 列:

    ID
0   NaN
1   NaN
2   NaN
3   0.0
4   0.0

TL;DR.长话短说;博士。 I want to know why some functions return a Series whilst some others a DataFrame as this made me confused when dealing with different operations within the same DataFrame.我想知道为什么有些函数返回一个Series而有些函数返回一个DataFrame ,因为这让我在处理同一个 DataFrame 中的不同操作时感到困惑。

From the document从文件

Size : 尺码:

 Returns Series Number of rows in each group.

For the sum , since you did not pass the column for sum, so it will return the data frame without the groupby key对于sum ,由于您没有传递 sum 的列,因此它将返回没有 groupby 键的数据框

df.groupby(["Name", "Type"])['ID'].sum()  # return Series

Function like diff and pct_change is not agg, it will return the value with the same index as original dataframe, for count , mean , sum they are agg, return with the value and groupby key as index Function 像diffpct_change不是 agg,它会返回与原始 dataframe 相同index的值,对于countmeansum它们是 agg,返回值和groupby键作为索引

The outputs are different because the aggregations are different, and those are what mostly control what is returned.输出不同是因为聚合不同,而这些主要控制返回的内容。 Think of the array equivalent.想想数组等价物。 The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input数据相同,但一个“聚合”返回单个标量值,另一个返回与输入大小相同的数组

import numpy as np
np.array([1,2,3]).sum()
#6

np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)

The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby does is create a mapping from the DataFrame to the groups. DataFrameGroupBy object 的聚合也是如此groupby所做的所有第一部分都是创建从 DataFrame 到组的映射。 Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).因为这并没有真正做任何事情,所以没有理由为什么具有不同操作的相同 groupby 需要返回相同类型的 output(见上文)。

gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...

The other important part here is that we have a DataFrame GroupBy object. There are also Series GroupBy objects, and that difference can change the return.这里的另一个重要部分是我们有一个DataFrame GroupBy object。还有Series GroupBy 对象,这种差异可以改变返回值。

gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>

So what happens when you aggregate?那么聚合时会发生什么?

With a DataFrameGroupBy when you choose an aggregation (like sum ) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys.使用DataFrameGroupBy ,当您选择聚合(如sum )时,每组折叠为单个值,返回值将是 DataFrame,其中索引是唯一的分组键。 The return is a DataFrame because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.返回值是DataFrame ,因为我们提供了一个 DataFrameGroupBy object。DataFrame 可以有多个列,如果有另一个数字列,它也会聚合该列,因此需要 DataFrame output。

gp.sum()
#             ID
#Name  Type     
#Book1 ebook   2
#Book2 paper   4
#Book3 paper   3

On the other hand if you use a SeriesGroupBy object (select a single column with [] ) then you'll get a Series back, again with the index of unique group keys.另一方面,如果您使用 SeriesGroupBy object(使用[]选择单个列),那么您将返回一个系列,同样带有唯一组键的索引。

df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|

#Name   Type 
#Book1  ebook    2
#Book2  paper    4
#Book3  paper    3
#Name: ID, dtype: int64

For aggregations that return arrays (like cumsum , pct_change ) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series.对于返回 arrays 的聚合(如cumsumpct_change ),DataFrameGroupBy 将返回 DataFrame,而 SeriesGroupBy 将返回一个系列。 But the index is no longer the unique group keys.但是索引不再是唯一的组键。 This is because that would make little sense;这是因为那没有什么意义; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation.通常,您希望在组内进行计算,然后将结果分配回原始 DataFrame。因此,返回的索引与您为聚合提供的原始 DataFrame 一样。 This makes creating these columns very simple as pandas handles all of the alignment这使得创建这些列非常简单,因为 pandas 处理所有 alignment

df['ID_pct_change'] = gp.pct_change()

#    Name   Type  ID  ID_pct_change
#0  Book1  ebook   1            NaN  
#1  Book2  paper   2            NaN   
#2  Book3  paper   3            NaN   
#3  Book1  ebook   1            0.0  # Calculated from row 0 and aligned.
#4  Book2  paper   2            0.0

But what about size ?但是size呢? That one is a bit weird .那个有点奇怪 The size of a group is a scalar.组的size是一个标量。 It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant.该组有多少列或这些列中的值是否丢失并不重要,因此向其发送 DataFrameGroupBy 或 SeriesGroupBy object 是无关紧要的。 As a result pandas will always return a Series .结果pandas将始终返回一个Series Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.再次作为一个返回标量的组级聚合,让返回值由唯一的组键索引是有意义的。

gp.size()
#Name   Type 
#Book1  ebook    2
#Book2  paper    2
#Book3  paper    1
#dtype: int64

Finally for completeness, though aggregations like sum return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum has a different index, so it won't align.最后为了完整起见,尽管像sum这样的聚合返回单个标量值,但将这些值带回原始 DataFrame 中该组的每一行通常很有用。但是,正常.sum的返回具有不同的索引,因此它不会对齐。 You could merge the values back on the unique keys, but pandas provides the ability to transform these aggregations.您可以将值merge回唯一键,但pandas提供了transform这些聚合的能力。 Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input由于此处的目的是将其恢复为原始 DataFrame,因此 Series/DataFrame 的索引与原始输入相同

gp.transform('sum')
#   ID
#0   2    # Row 0 is Book1 ebook which has a group sum of 2
#1   4
#2   3
#3   2    # Row 3 is also Book1 ebook which has a group sum of 2
#4   4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么&#39;groupby(x,np.isnan)&#39;与&#39;groupby(x)如果key是nan&#39;的行为不同? - Why does 'groupby(x, np.isnan)' behave differently to 'groupby(x) if key is nan'? 为什么这个上下文管理器与dict理解有不同的表现? - Why does this contextmanager behave differently with dict comprehensions? 为什么 numpy import 的行为不同? - Why does numpy import behave differently? 为什么 pyzmq 订阅者与 asyncio 的行为不同? - Why does pyzmq subscriber behave differently with asyncio? 为什么 Python 3 for loop output 和行为不同? - Why does Python 3 for loop output and behave differently? 为什么这个argparse代码在Python 2和3之间表现不同? - Why does this argparse code behave differently between Python 2 and 3? 为什么在包装时sys.excepthook会有不同的行为? - Why does sys.excepthook behave differently when wrapped? 为什么过滤QuerySet对于用户和超级用户的行为会有所不同? - Why does filtering a QuerySet behave differently for user vs. superuser? Python 2-为什么“ with”在嵌入式C代码中表现不同? - python 2 - why does 'with' behave differently in embedded c code? 为什么PyQt4在Jupyter和IPython Notebook之间的行为有所不同? - Why does PyQt4 behave differently between Jupyter and IPython notebook?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM