简体   繁体   English

`pd.pivot_table`和`pd.DataFrame.groupby` +`pd.DataFrame.unstack`之间是否存在完全重叠?

[英]Is There Complete Overlap Between `pd.pivot_table` and `pd.DataFrame.groupby` + `pd.DataFrame.unstack`?

(Please note that there's a question Pandas: group by and Pivot table difference , but this question is different.) (请注意, Pandas有一个问题:group by和Pivot表差异 ,但这个问题不同。)

Suppose you start with a DataFrame 假设您从DataFrame开始

df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]: 
   a  b  val
0  x  0    0
1  x  1    1
2  y  0    2
3  y  1    3

Now suppose you want to make the index a , the columns b , the values in a cell val , and specify what to do if there are two or more values in a resulting cell: 现在假设您要创建索引a ,列b ,单元格val的值,并指定在结果单元格中有两个或更多值时要执行的操作:

b  0  1
a      
x  0  1
y  2  3

Then you can do this either through 然后你可以通过

df.val.groupby([df.a, df.b]).sum().unstack()

or through 或通过

pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')

So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). 所以在我看来,两者之间的对应关系之间有一个简单的对应关系(给定一个,你几乎可以写一个脚本来将它转换成另一个)。 I also thought of more complex cases with hierarchical indices / columns, but I still see no difference. 我还想到了更复杂的层次索引/列的情况,但我仍然认为没有区别。

Is there something I've missed? 有没有我错过的东西?

  • Are there operations that can be performed using one and not the other? 是否可以使用one而不是其他操作执行操作?

  • Are there, perhaps, operations easier to perform using one over the other? 也许,操作更容易使用一个而不是另一个?

  • If not, why not deprecate pivot_tale ? 如果没有,为什么不弃用pivot_tale groupby seems much more general. groupby似乎更普遍。

If I understood the source code for pivot_table(index, columns, values, aggfunc) correctly it's tuned up equivalent for: 如果我正确理解了pivot_table(index, columns, values, aggfunc)的源代码pivot_table(index, columns, values, aggfunc) ,它的调整等效于:

df.groupby([index + columns]).agg(aggfunc).unstack(columns)

plus: 加:

  • margins (subtotals and grand totals as @ayhan has already said ) 保证金( @ayhan已经说过的小计和总计)
  • pivot_table() also removes extra multi-levels from columns axis (see example below) pivot_table()还从列轴移除额外的多级别(参见下面的示例)
  • convenient dropna parameter: Do not include columns whose entries are all NaN 方便的dropna参数:不包括条目全部为NaN的列

Demo: (I took this DF from the docstring [source code for pivot_table() ]) 演示:(我从docstring中获取了这个DF [ pivot_table()源代码])

In [40]: df
Out[40]:
     A    B      C  D
0  foo  one  small  1
1  foo  one  large  2
2  foo  one  large  2
3  foo  two  small  3
4  foo  two  small  3
5  bar  one  large  4
6  bar  one  small  5
7  bar  two  small  6
8  bar  two  large  7

In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

pay attention at the top level column: D 注意顶级栏目: D

In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
            D
          sum        mean
C       large small large small
A   B
bar one   4.0   5.0   4.0   5.0
    two   7.0   6.0   7.0   6.0
foo one   4.0   1.0   2.0   1.0
    two   NaN   6.0   NaN   3.0

why not deprecate pivot_tale? 为什么不弃用pivot_tale? groupby seems much more general. groupby似乎更普遍。

IMO, because it's very easy to use and very convenient! IMO,因为它非常易于使用且非常方便! ;) ;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM