[英]Is There Complete Overlap Between `pd.pivot_table` and `pd.DataFrame.groupby` + `pd.DataFrame.unstack`?
(Please note that there's a question Pandas: group by and Pivot table difference , but this question is different.) (请注意, Pandas有一个问题:group by和Pivot表差异 ,但这个问题不同。)
Suppose you start with a DataFrame 假设您从DataFrame开始
df = pd.DataFrame({'a': ['x'] * 2 + ['y'] * 2, 'b': [0, 1, 0, 1], 'val': range(4)})
>>> df
Out[18]:
a b val
0 x 0 0
1 x 1 1
2 y 0 2
3 y 1 3
Now suppose you want to make the index a
, the columns b
, the values in a cell val
, and specify what to do if there are two or more values in a resulting cell: 现在假设您要创建索引
a
,列b
,单元格val
的值,并指定在结果单元格中有两个或更多值时要执行的操作:
b 0 1
a
x 0 1
y 2 3
Then you can do this either through 然后你可以通过
df.val.groupby([df.a, df.b]).sum().unstack()
or through 或通过
pd.pivot_table(df, index='a', columns='b', values='val', aggfunc='sum')
So it seems to me that there's a simple correspondence between correspondence between the two (given one, you could almost write a script to transform it into the other). 所以在我看来,两者之间的对应关系之间有一个简单的对应关系(给定一个,你几乎可以写一个脚本来将它转换成另一个)。 I also thought of more complex cases with hierarchical indices / columns, but I still see no difference.
我还想到了更复杂的层次索引/列的情况,但我仍然认为没有区别。
Is there something I've missed? 有没有我错过的东西?
Are there operations that can be performed using one and not the other? 是否可以使用one而不是其他操作执行操作?
Are there, perhaps, operations easier to perform using one over the other? 也许,操作更容易使用一个而不是另一个?
If not, why not deprecate pivot_tale
? 如果没有,为什么不弃用
pivot_tale
? groupby
seems much more general. groupby
似乎更普遍。
If I understood the source code for pivot_table(index, columns, values, aggfunc)
correctly it's tuned up equivalent for: 如果我正确理解了
pivot_table(index, columns, values, aggfunc)
的源代码pivot_table(index, columns, values, aggfunc)
,它的调整等效于:
df.groupby([index + columns]).agg(aggfunc).unstack(columns)
plus: 加:
pivot_table()
also removes extra multi-levels from columns axis (see example below) pivot_table()
还从列轴移除额外的多级别(参见下面的示例) dropna
parameter: Do not include columns whose entries are all NaN dropna
参数:不包括条目全部为NaN的列 Demo: (I took this DF from the docstring [source code for pivot_table()
]) 演示:(我从docstring中获取了这个DF [
pivot_table()
源代码])
In [40]: df
Out[40]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7
In [41]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc=[np.sum,np.mean])
Out[41]:
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
pay attention at the top level column: D
注意顶级栏目:
D
In [42]: df.groupby(['A','B','C']).agg([np.sum, np.mean]).unstack('C')
Out[42]:
D
sum mean
C large small large small
A B
bar one 4.0 5.0 4.0 5.0
two 7.0 6.0 7.0 6.0
foo one 4.0 1.0 2.0 1.0
two NaN 6.0 NaN 3.0
why not deprecate pivot_tale?
为什么不弃用pivot_tale? groupby seems much more general.
groupby似乎更普遍。
IMO, because it's very easy to use and very convenient! IMO,因为它非常易于使用且非常方便! ;)
;)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.