[英]Pandas: Difference between pivot and pivot_table. Why is only pivot_table working?
I have the following dataframe.我有以下数据框。
df.head(30)
struct_id resNum score_type_name score_value
0 4294967297 1 omega 0.064840
1 4294967297 1 fa_dun 2.185618
2 4294967297 1 fa_dun_dev 0.000027
3 4294967297 1 fa_dun_semi 2.185591
4 4294967297 1 ref -1.191180
5 4294967297 2 rama -0.795161
6 4294967297 2 omega 0.222345
7 4294967297 2 fa_dun 1.378923
8 4294967297 2 fa_dun_dev 0.028560
9 4294967297 2 fa_dun_rot 1.350362
10 4294967297 2 p_aa_pp -0.442467
11 4294967297 2 ref 0.249477
12 4294967297 3 rama 0.267443
13 4294967297 3 omega 0.005106
14 4294967297 3 fa_dun 0.020352
15 4294967297 3 fa_dun_dev 0.025507
16 4294967297 3 fa_dun_rot -0.005156
17 4294967297 3 p_aa_pp -0.096847
18 4294967297 3 ref 0.979644
19 4294967297 4 rama -1.403292
20 4294967297 4 omega 0.212160
21 4294967297 4 fa_dun 4.218029
22 4294967297 4 fa_dun_dev 0.003712
23 4294967297 4 fa_dun_semi 4.214317
24 4294967297 4 p_aa_pp -0.462765
25 4294967297 4 ref -1.960940
26 4294967297 5 rama -0.600053
27 4294967297 5 omega 0.061867
28 4294967297 5 fa_dun 3.663050
29 4294967297 5 fa_dun_dev 0.004953
According to the pivot documentation, I should be able to reshape this on the score_type_name using the pivot function.根据枢轴文档,我应该能够使用枢轴函数在 score_type_name 上对其进行重塑。
df.pivot(columns='score_type_name',values='score_value',index=['struct_id','resNum'])
But, I get the following.但是,我得到以下信息。
However, pivot_table function seems to work:然而,pivot_table 函数似乎工作:
pivoted = df.pivot_table(columns='score_type_name',
values='score_value',
index=['struct_id','resNum'])
But it does not lend itself, for me atleast, to further analysis.但至少对我来说,它不适合进一步分析。 I want it to just have the struct_id, resNum, and score_type_name as columns instead of stacking the score_type_name on top of the other columns.
我希望它只将 struct_id、resNum 和 score_type_name 作为列,而不是将 score_type_name 堆叠在其他列的顶部。 Additionally, I want the struct_id to be for every row, and not aggregate in a joined row like it does for the table.
此外,我希望 struct_id 适用于每一行,而不是像对表那样聚合在连接的行中。
So can anyone tell me how I can get a nice Dataframe like I want using pivot?那么谁能告诉我如何获得像我想要的那样使用数据透视表的漂亮数据框? Additionally, from the documentation, I can't tell why pivot_table works and pivot doesn't.
此外,从文档中,我不知道为什么 pivot_table 起作用而 pivot 不起作用。 If I look at the first example of pivot, it looks like exactly what I need.
如果我查看第一个枢轴示例,它看起来正是我所需要的。
PS I did post a question in reference to this problem, but I did such a poor job of demonstrating the output, I deleted it and tried again using ipython notebook. PS 我确实发布了一个关于这个问题的问题,但是我在演示输出方面做得很差,我删除了它并使用 ipython notebook 再次尝试。 I apologize in advance if you are seeing this twice.
如果您看到两次,我提前道歉。
Here is the notebook for your full reference 这是笔记本供您完整参考
EDIT - My desired results would look like this (made in excel):编辑 - 我想要的结果看起来像这样(用 excel 制作):
StructId resNum pdb_residue_number chain_id name3 fa_dun fa_dun_dev fa_dun_rot fa_dun_semi omega p_aa_pp rama ref
4294967297 1 99 A ASN 2.1856 0.0000 2.1856 0.0648 -1.1912
4294967297 2 100 A MET 1.3789 0.0286 1.3504 0.2223 -0.4425 -0.7952 0.2495
4294967297 3 101 A VAL 0.0204 0.0255 -0.0052 0.0051 -0.0968 0.2674 0.9796
4294967297 4 102 A GLU 4.2180 0.0037 4.2143 0.2122 -0.4628 -1.4033 -1.9609
4294967297 5 103 A GLN 3.6630 0.0050 3.6581 0.0619 -0.2759 -0.6001 -1.5172
4294967297 6 104 A MET 1.5175 0.2206 1.2968 0.0504 -0.3758 -0.7419 0.2495
4294967297 7 105 A HIS 3.6987 0.0184 3.6804 0.0547 0.4019 -0.1489 0.3883
4294967297 8 106 A THR 0.1048 0.0134 0.0914 0.0003 -0.7963 -0.4033 0.2013
4294967297 9 107 A ASP 2.3626 0.0005 2.3620 0.0521 0.1955 -0.3499 -1.6300
4294967297 10 108 A ILE 1.8447 0.0270 1.8176 0.0971 0.1676 -0.4071 1.0806
4294967297 11 109 A ILE 0.1276 0.0092 0.1183 0.0208 -0.4026 -0.0075 1.0806
4294967297 12 110 A SER 0.2921 0.0342 0.2578 0.0342 -0.2426 -1.3930 0.1654
4294967297 13 111 A LEU 0.6483 0.0019 0.6464 0.0845 -0.3565 -0.2356 0.7611
4294967297 14 112 A TRP 2.5965 0.1507 2.4457 0.5143 -0.1370 -0.5373 1.2341
4294967297 15 113 A ASP 2.6448 0.1593 0.0510 -0.5011
For anyone who is still interested in the difference between pivot
and pivot_table
, there are mainly two differences:对于仍然对
pivot
和pivot_table
之间的区别感兴趣的任何人,主要有两个区别:
pivot_table
is a generalization of pivot
that can handle duplicate values for one pivoted index/column pair. pivot_table
是pivot
一种推广,可以处理一个旋转索引/列对的重复值。 Specifically, you can give pivot_table
a list of aggregation functions using keyword argument aggfunc
.aggfunc
为pivot_table
提供聚合函数列表。 The default aggfunc
of pivot_table
is numpy.mean
.aggfunc
的pivot_table
是numpy.mean
。pivot_table
also supports using multiple columns for the index and column of the pivoted table. pivot_table
还支持使用多列的枢转表的索引和列。 A hierarchical index will be automatically generated for you. REF: pivot
and pivot_table
REF:
pivot
和pivot_table
Another caveat:另一个警告:
pivot_table
will only allow numerical types as "values=", whereas pivot
will take string types as "values=". pivot_table
只允许数字类型作为“values=”,而pivot
将字符串类型作为“values=”。
I debugged it a little bit.我调试了一下。
DataFrame.pivot()
and DataFrame.pivot_table()
are different. DataFrame.pivot()
和DataFrame.pivot_table()
是不同的。pivot()
doesn't accept a list for index. pivot()
不接受索引列表。pivot_table()
accepts. pivot_table()
接受。 Internally, both of them are using reset_index()
/ stack()
/ unstack()
to do the job.在内部,他们都使用
reset_index()
/ stack()
/ unstack()
来完成这项工作。
pivot()
is just a short cut for simple usage, I think.我认为,
pivot()
只是简单使用的捷径。
I'm not sure I understand, but I'll give it a try.我不确定我是否理解,但我会尝试一下。 I usually use stack/unstack instead of pivot, is this closer to what you want?
我通常使用堆栈/取消堆栈而不是枢轴,这是否更接近您想要的?
df.set_index(['struct_id','resNum','score_type_name']).unstack()
score_value
score_type_name fa_dun fa_dun_dev fa_dun_rot fa_dun_semi omega
struct_id resNum
4294967297 1 2.185618 0.000027 NaN 2.185591 0.064840
2 1.378923 0.028560 1.350362 NaN 0.222345
3 0.020352 0.025507 -0.005156 NaN 0.005106
4 4.218029 0.003712 NaN 4.214317 0.212160
5 3.663050 0.004953 NaN NaN 0.061867
score_type_name p_aa_pp rama ref
struct_id resNum
4294967297 1 NaN NaN -1.191180
2 -0.442467 -0.795161 0.249477
3 -0.096847 0.267443 0.979644
4 -0.462765 -1.403292 -1.960940
5 NaN -0.600053 NaN
I'm not sure why your pivot isn't working (kinda seems to me like it should, but I could be wrong), but it does seem to work (or at least not give an error) if I leave off 'struct_id'.我不确定为什么你的枢轴不起作用(在我看来它应该是这样,但我可能是错的),但如果我不使用“struct_id”,它似乎确实有效(或者至少不会给出错误) . Of course, that's not really a useful solution for the full dataset where you have more than one different values for 'struct_id'.
当然,对于“struct_id”有多个不同值的完整数据集,这并不是一个真正有用的解决方案。
df.pivot(columns='score_type_name',values='score_value',index='resNum')
score_type_name fa_dun fa_dun_dev fa_dun_rot fa_dun_semi omega
resNum
1 2.185618 0.000027 NaN 2.185591 0.064840
2 1.378923 0.028560 1.350362 NaN 0.222345
3 0.020352 0.025507 -0.005156 NaN 0.005106
4 4.218029 0.003712 NaN 4.214317 0.212160
5 3.663050 0.004953 NaN NaN 0.061867
score_type_name p_aa_pp rama ref
resNum
1 NaN NaN -1.191180
2 -0.442467 -0.795161 0.249477
3 -0.096847 0.267443 0.979644
4 -0.462765 -1.403292 -1.960940
5 NaN -0.600053 NaN
Edit to add: reset_index()
will convert from a multi-index (hierarchical) to a flatter style.编辑添加:
reset_index()
将从多索引(分层)转换为更扁平的样式。 There is still some hierarchy in the column names, sometimes the easiest way to get rid of those is just to do df.columns=['var1','var2',...]
although there are more sophisticated ways if you do some searching.列名中仍然有一些层次结构,有时摆脱这些的最简单方法就是执行
df.columns=['var1','var2',...]
尽管如果您执行一些操作,还有更复杂的方法搜索。
df.set_index(['struct_id','resNum','score_type_name']).unstack().reset_index()
struct_id resNum score_value
score_type_name fa_dun fa_dun_dev fa_dun_rot
0 4294967297 1 2.185618 0.000027 NaN
1 4294967297 2 1.378923 0.028560 1.350362
2 4294967297 3 0.020352 0.025507 -0.005156
3 4294967297 4 4.218029 0.003712 NaN
4 4294967297 5 3.663050 0.004953 NaN
pivot()
is used for pivoting without aggregation. pivot()
用于没有聚合的旋转。 Therefore, it can't deal with duplicate values for one index/column pair.因此,它无法处理一对索引/列的重复值。
Since here your index=['struct_id','resNum']
have multiple duplicates, therefore pivot doesn't work.由于这里您的
index=['struct_id','resNum']
有多个重复项,因此数据透视不起作用。
However, pivot_table
will work because it will handle duplicate values by aggregating them.但是,
pivot_table
将起作用,因为它将通过聚合它们来处理重复值。
To get the dataframe you obtained from the pivot_table
call into the format you want:要将您从
pivot_table
调用获得的数据帧转换为您想要的格式:
pivoted.columns.name=None ## remove the score_type_name
result = pivoted.reset_index() ## puts index columns back into dataframe body
The given snippet may help you out for further flatten the look of your dataframe给定的代码段可以帮助您进一步扁平化数据框的外观
df.set_index(['struct_id','resNum','score_type_name']).unstack().reset_index()
df.loc[:,['struct_id','resNum','fa_dun','fa_dun_dev','fa_dun_rot']]
Before calling pivot we need to ensure that our data does not have rows with duplicate values for the specified columns .在调用 pivot 之前,我们需要确保我们的数据中没有指定列具有重复值的行。
Pivot with duplicate give枢轴重复给
Index contains duplicate entries, cannot reshape
If we can't ensure this we may have to use the pivot_table method instead.如果我们不能确保这一点,我们可能不得不使用pivot_table方法来代替。
Please find the link below for a more detailed explanation请找到下面的链接以获得更详细的解释
https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/ https://nikgrozev.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.