[英]Why does groupby operations behave differently
When using pandas groupby functions and manipulating the output after the groupby, I've noticed that some functions behave differently in terms of what is returned as the index and how this can be manipulated.当使用pandas groupby 函数并在 groupby 之后操作 output 时,我注意到某些函数在作为索引返回的内容以及如何操作方面表现不同。
Say we have a dataframe with the following information:假设我们有一个 dataframe,其中包含以下信息:
Name Type ID
0 Book1 ebook 1
1 Book2 paper 2
2 Book3 paper 3
3 Book1 ebook 1
4 Book2 paper 2
if we do如果我们这样做
df.groupby(["Name", "Type"]).sum()
we get a DataFrame
:我们得到一个
DataFrame
:
ID
Name Type
Book1 ebook 2
Book2 paper 4
Book3 paper 3
which contains a MultiIndex with the columns used in the groupby:其中包含一个 MultiIndex,其中包含 groupby 中使用的列:
MultiIndex([('Book1', 'ebook'),
('Book2', 'paper'),
('Book3', 'paper')],
names=['Name', 'Type'])
and one column called ID
.和一列称为
ID
。
but if I apply a size()
function, the result is a Series
:但如果我应用
size()
function,结果是一个Series
:
Name Type
Book1 ebook 2
Book2 paper 2
Book3 paper 1
dtype: int64
And at last, if I do a pct_change()
, we get only the resulting DataFrame column:最后,如果我执行
pct_change()
,我们只会得到结果 DataFrame 列:
ID
0 NaN
1 NaN
2 NaN
3 0.0
4 0.0
TL;DR.长话短说;博士。 I want to know why some functions return a
Series
whilst some others a DataFrame
as this made me confused when dealing with different operations within the same DataFrame.我想知道为什么有些函数返回一个
Series
而有些函数返回一个DataFrame
,因为这让我在处理同一个 DataFrame 中的不同操作时感到困惑。
From the document从文件
Returns Series Number of rows in each group.
For the sum
, since you did not pass the column for sum, so it will return the data frame without the groupby key对于
sum
,由于您没有传递 sum 的列,因此它将返回没有 groupby 键的数据框
df.groupby(["Name", "Type"])['ID'].sum() # return Series
Function like diff
and pct_change
is not agg, it will return the value with the same index
as original dataframe, for count
, mean
, sum
they are agg, return with the value and groupby
key as index Function 像
diff
和pct_change
不是 agg,它会返回与原始 dataframe 相同index
的值,对于count
, mean
, sum
它们是 agg,返回值和groupby
键作为索引
The outputs are different because the aggregations are different, and those are what mostly control what is returned.输出不同是因为聚合不同,而这些主要控制返回的内容。 Think of the array equivalent.
想想数组等价物。 The data are the same but one "aggregation" returns a single scalar value, the other returns an array the same size as the input
数据相同,但一个“聚合”返回单个标量值,另一个返回与输入大小相同的数组
import numpy as np
np.array([1,2,3]).sum()
#6
np.array([1,2,3]).cumsum()
#array([1, 3, 6], dtype=int32)
The same thing goes for aggregations of a DataFrameGroupBy object. All the first part of the groupby
does is create a mapping from the DataFrame to the groups. DataFrameGroupBy object 的聚合也是如此
groupby
所做的所有第一部分都是创建从 DataFrame 到组的映射。 Since this doesn't really do anything there's no reason why the same groupby with a different operation needs to return the same type of output (see above).因为这并没有真正做任何事情,所以没有理由为什么具有不同操作的相同 groupby 需要返回相同类型的 output(见上文)。
gp = df.groupby(["Name", "Type"])
# Haven't done any aggregations yet...
The other important part here is that we have a DataFrame GroupBy object. There are also Series GroupBy objects, and that difference can change the return.这里的另一个重要部分是我们有一个DataFrame GroupBy object。还有Series GroupBy 对象,这种差异可以改变返回值。
gp
#<pandas.core.groupby.generic.DataFrameGroupBy object>
So what happens when you aggregate?那么聚合时会发生什么?
With a DataFrameGroupBy
when you choose an aggregation (like sum
) that collapses to a single value per group the return will be a DataFrame where the indices are the unique grouping keys.使用
DataFrameGroupBy
,当您选择聚合(如sum
)时,每组折叠为单个值,返回值将是 DataFrame,其中索引是唯一的分组键。 The return is a DataFrame
because we provided a DataFrameGroupBy object. DataFrames can have multiple columns and had there been another numeric column it would have aggregated that too, necessitating the DataFrame output.返回值是
DataFrame
,因为我们提供了一个 DataFrameGroupBy object。DataFrame 可以有多个列,如果有另一个数字列,它也会聚合该列,因此需要 DataFrame output。
gp.sum()
# ID
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
On the other hand if you use a SeriesGroupBy object (select a single column with []
) then you'll get a Series back, again with the index of unique group keys.另一方面,如果您使用 SeriesGroupBy object(使用
[]
选择单个列),那么您将返回一个系列,同样带有唯一组键的索引。
df.groupby(["Name", "Type"])['ID'].sum()
|------- SeriesGroupBy ----------|
#Name Type
#Book1 ebook 2
#Book2 paper 4
#Book3 paper 3
#Name: ID, dtype: int64
For aggregations that return arrays (like cumsum
, pct_change
) a DataFrameGroupBy will return a DataFrame and a SeriesGroupBy will return a Series.对于返回 arrays 的聚合(如
cumsum
、 pct_change
),DataFrameGroupBy 将返回 DataFrame,而 SeriesGroupBy 将返回一个系列。 But the index is no longer the unique group keys.但是索引不再是唯一的组键。 This is because that would make little sense;
这是因为那没有什么意义; typically you'd want to do a calculation within the group and then assign the result back to the original DataFrame. As a result the return is indexed like the original DataFrame you provided for aggregation.
通常,您希望在组内进行计算,然后将结果分配回原始 DataFrame。因此,返回的索引与您为聚合提供的原始 DataFrame 一样。 This makes creating these columns very simple as pandas handles all of the alignment
这使得创建这些列非常简单,因为 pandas 处理所有 alignment
df['ID_pct_change'] = gp.pct_change()
# Name Type ID ID_pct_change
#0 Book1 ebook 1 NaN
#1 Book2 paper 2 NaN
#2 Book3 paper 3 NaN
#3 Book1 ebook 1 0.0 # Calculated from row 0 and aligned.
#4 Book2 paper 2 0.0
But what about size
?但是
size
呢? That one is a bit weird .那个有点奇怪。 The
size
of a group is a scalar.组的
size
是一个标量。 It doesn't matter how many columns the group has or whether values in those columns are missing, so sending it a DataFrameGroupBy or SeriesGroupBy object is irrelevant.该组有多少列或这些列中的值是否丢失并不重要,因此向其发送 DataFrameGroupBy 或 SeriesGroupBy object 是无关紧要的。 As a result
pandas
will always return a Series
.结果
pandas
将始终返回一个Series
。 Again being a group level aggregation that returns a scalar it makes sense to have the return indexed by the unique group keys.再次作为一个返回标量的组级聚合,让返回值由唯一的组键索引是有意义的。
gp.size()
#Name Type
#Book1 ebook 2
#Book2 paper 2
#Book3 paper 1
#dtype: int64
Finally for completeness, though aggregations like sum
return a single scalar value it can often be useful to bring those values back to the every row for that group in the original DataFrame. However the return of a normal .sum
has a different index, so it won't align.最后为了完整起见,尽管像
sum
这样的聚合返回单个标量值,但将这些值带回原始 DataFrame 中该组的每一行通常很有用。但是,正常.sum
的返回具有不同的索引,因此它不会对齐。 You could merge
the values back on the unique keys, but pandas
provides the ability to transform
these aggregations.您可以将值
merge
回唯一键,但pandas
提供了transform
这些聚合的能力。 Since the intent here is to bring it back to the original DataFrame, the Series/DataFrame is indexed like the original input由于此处的目的是将其恢复为原始 DataFrame,因此 Series/DataFrame 的索引与原始输入相同
gp.transform('sum')
# ID
#0 2 # Row 0 is Book1 ebook which has a group sum of 2
#1 4
#2 3
#3 2 # Row 3 is also Book1 ebook which has a group sum of 2
#4 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.