[英]Difference between groupby and pivot_table
I just started learning Pandas and was wondering if there is any difference between groupby
and pivot_table
functions.我刚开始学习 Pandas 并且想知道
groupby
和pivot_table
函数之间是否有任何区别。 Can anyone help me understand the difference between them?谁能帮我理解它们之间的区别?
Both pivot_table
and groupby
are used to aggregate your dataframe. pivot_table
和groupby
都用于聚合您的数据框。 The difference is only with regard to the shape of the result.区别仅在于结果的形状。
Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
a table is created where a
is on the row axis, b
is on the column axis, and the values are the sum of c
.使用
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
创建一个表,其中a
在行轴上, b
位于列轴上,值是c
的总和。
Example:例子:
df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
b 1 2
a
1 0.528470 0.484766
2 0.187277 0.144326
3 0.866832 0.650100
Using groupby
, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.使用
groupby
,给定的维度被放置到列中,并为这些维度的每个组合创建行。
In this example, we create a series of the sum of values c
, grouped by all unique combinations of a
and b
.在此示例中,我们创建了一系列值
c
的总和,按a
和b
的所有唯一组合分组。
df.groupby(['a','b'])['c'].sum()
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
Name: c, dtype: float64
A similar usage of groupby
is if we omit the ['c']
. groupby
的类似用法是如果我们省略['c']
。 In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a
and b
.在这种情况下,它会创建一个数据框(不是一系列),其中包含按
a
和b
的唯一值分组的所有剩余列的总和。
print df.groupby(["a","b"]).sum()
c
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
It's more appropriate to use .pivot_table()
instead of .groupby()
when you need to show aggregates with both rows and column labels.当您需要显示具有行和列标签的聚合时,使用
.pivot_table()
而不是.groupby()
) 更合适。
.pivot_table()
makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby()
with few extra steps. .pivot_table()
可以轻松地同时创建行和列标签,并且更可取,即使您可以使用.groupby()
获得类似的结果,只需几个额外的步骤。
pivot_table = groupby + unstack and groupby = pivot_table + stack hold True. pivot_table = groupby + unstack和groupby = pivot_table + stack保持真。
In particular, if columns
parameter of pivot_table()
is not used, then groupby()
and pivot_table()
both produce the same result (if the same aggregator function is used).特别是,如果未使用
pivot_table()
的columns
参数,则groupby()
和pivot_table()
都会产生相同的结果(如果使用相同的聚合器函数)。
# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})
# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')
# equality test
gb.equals(pt) #True
In general, if we check the source code , pivot_table()
internally calls __internal_pivot_table()
.一般来说,如果我们检查源代码,
pivot_table()
在内部调用__internal_pivot_table()
。 This function creates a single flat list out of index and columns and calls groupby()
with this list as the grouper.此函数从索引和列中创建一个平面列表,并使用此列表作为分组器调用
groupby()
。 Then after aggregation, calls unstack()
on the list of columns.然后在聚合之后,在列列表上调用
unstack()
。
If columns are never passed, there is nothing to unstack on, so groupby
and pivot_table
trivially produce the same output.如果从不传递列,则没有什么可取消堆叠的,因此
groupby
和pivot_table
会生成相同的输出。
A demonstration of this function is:此功能的演示是:
gb = (
df
.groupby(['a','b'])[['c']].sum()
.unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
gb.equals(pt) # True
As stack()
is the inverse operation of unstack()
, the following holds True as well:由于
stack()
是unstack()
的逆运算,因此以下也成立:
(
df
.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
.stack(['b'])
.equals(
df.groupby(['a','b'])[['c']].sum()
)
) # True
In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack()
/ unstack()
, both will result in the same output.总之,根据用例,一个比另一个更方便,但它们都可以代替另一个使用,并且在正确应用
stack()
/ unstack()
之后,两者都将产生相同的输出。
Difference between pivot_table
and groupby
pivot_table
和groupby
之间的区别
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.