简体   繁体   English

groupby 和 pivot_table 的区别

[英]Difference between groupby and pivot_table

I just started learning Pandas and was wondering if there is any difference between groupby and pivot_table functions.我刚开始学习 Pandas 并且想知道groupbypivot_table函数之间是否有任何区别。 Can anyone help me understand the difference between them?谁能帮我理解它们之间的区别?

Both pivot_table and groupby are used to aggregate your dataframe. pivot_tablegroupby都用于聚合您的数据框。 The difference is only with regard to the shape of the result.区别仅在于结果的形状。

Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) a table is created where a is on the row axis, b is on the column axis, and the values are the sum of c .使用pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)创建一个表,其中a在行轴上, b位于列轴上,值是c的总和。

Example:例子:

df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)

b         1         2
a                    
1  0.528470  0.484766
2  0.187277  0.144326
3  0.866832  0.650100

Using groupby , the dimensions given are placed into columns, and rows are created for each combination of those dimensions.使用groupby ,给定的维度被放置到列中,并为这些维度的每个组合创建行。

In this example, we create a series of the sum of values c , grouped by all unique combinations of a and b .在此示例中,我们创建了一系列值c的总和,按ab的所有唯一组合分组。

df.groupby(['a','b'])['c'].sum()

a  b
1  1    0.528470
   2    0.484766
2  1    0.187277
   2    0.144326
3  1    0.866832
   2    0.650100
Name: c, dtype: float64

A similar usage of groupby is if we omit the ['c'] . groupby的类似用法是如果我们省略['c'] In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a and b .在这种情况下,它会创建一个数据框(不是一系列),其中包含按ab的唯一值分组的所有剩余列的总和。

print df.groupby(["a","b"]).sum()
            c
a b          
1 1  0.528470
  2  0.484766
2 1  0.187277
  2  0.144326
3 1  0.866832
  2  0.650100

It's more appropriate to use .pivot_table() instead of .groupby() when you need to show aggregates with both rows and column labels.当您需要显示具有行和列标签的聚合时,使用.pivot_table()而不是.groupby() ) 更合适。

.pivot_table() makes it easy to create row and column labels at the same time and is preferable, even though you can get similar results using .groupby() with few extra steps. .pivot_table()可以轻松地同时创建行和列标签,并且更可取,即使您可以使用.groupby()获得类似的结果,只需几个额外的步骤。

pivot_table = groupby + unstack and groupby = pivot_table + stack hold True. pivot_table = groupby + unstackgroupby = pivot_table + stack保持真。

In particular, if columns parameter of pivot_table() is not used, then groupby() and pivot_table() both produce the same result (if the same aggregator function is used).特别是,如果未使用pivot_table()columns参数,则groupby()pivot_table()都会产生相同的结果(如果使用相同的聚合器函数)。

# sample
df = pd.DataFrame({"a": [1,1,1,2,2,2], "b": [1,1,2,2,3,3], "c": [0,0.5,1,1,2,2]})

# example
gb = df.groupby(['a','b'])[['c']].sum()
pt = df.pivot_table(index=['a','b'], values=['c'], aggfunc='sum')

# equality test
gb.equals(pt) #True

In general, if we check the source code , pivot_table() internally calls __internal_pivot_table() .一般来说,如果我们检查源代码pivot_table()在内部调用__internal_pivot_table() This function creates a single flat list out of index and columns and calls groupby() with this list as the grouper.此函数从索引和列中创建一个平面列表,并使用此列表作为分组器调用groupby() Then after aggregation, calls unstack() on the list of columns.然后在聚合之后,在列列表上调用unstack()

If columns are never passed, there is nothing to unstack on, so groupby and pivot_table trivially produce the same output.如果从不传递列,则没有什么可取消堆叠的,因此groupbypivot_table会生成相同的输出。

A demonstration of this function is:此功能的演示是:

gb = (
    df
    .groupby(['a','b'])[['c']].sum()
    .unstack(['b'])
)
pt = df.pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')

gb.equals(pt) # True

As stack() is the inverse operation of unstack() , the following holds True as well:由于stack()unstack()的逆运算,因此以下也成立:

(
    df
    .pivot_table(index=['a'], columns=['b'], values=['c'], aggfunc='sum')
    .stack(['b'])
    .equals(
        df.groupby(['a','b'])[['c']].sum()
    )
) # True

In conclusion, depending on the use case, one is more convenient than the other but they can both be used instead of the other and after correctly applying stack() / unstack() , both will result in the same output.总之,根据用例,一个比另一个更方便,但它们都可以代替另一个使用,并且在正确应用stack() / unstack()之后,两者都将产生相同的输出。

Difference between pivot_table and groupby pivot_tablegroupby之间的区别

pivot_table 数据透视表

数据透视表

groupby 通过...分组

通过...分组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM