简体   繁体   English

如何在 Pandas Pivot 表中聚合数据?

[英]How to aggregate data in a Pandas Pivot Table?

I am carrying out a spatial alignment task where I am exploring the effect of different score/rescore functions on the quality of the alignment (measured by RMSD).我正在执行空间 alignment 任务,我正在探索不同评分/重新评分函数对 alignment 质量的影响(由 RMSD 测量)。 I have long form data where I have run all scoring / rescoring combinations for different systems and have repeated 3 times.我有长表格数据,我已经为不同系统运行了所有评分/重新评分组合,并重复了 3 次。

Here's some sample test data:以下是一些示例测试数据:

   identifier score rescore  rmsd  repeat
0        1abc   plp     asp   1.2       1
1        1abc   plp     asp   1.3       2
2        1abc   plp     asp   1.5       3
3        1abc   plp     plp   3.2       1
4        1abc   plp     plp   3.3       2
5        1abc   plp     plp   3.5       3
6        1abc   asp     asp   5.2       1
7        1abc   asp     asp   5.3       2
8        1abc   asp     asp   5.5       3
9        1abc   asp     plp   1.2       1
10       1abc   asp     plp   1.3       2
11       1abc   asp     plp   1.5       3
12       2def   plp     asp   1.0       1
13       2def   plp     asp   1.1       2
14       2def   plp     asp   1.2       3
15       2def   plp     plp   3.0       1
16       2def   plp     plp   3.1       2
17       2def   plp     plp   3.2       3
18       2def   asp     asp   5.0       1
19       2def   asp     asp   5.1       2
20       2def   asp     asp   5.2       3
21       2def   asp     plp   1.0       1
22       2def   asp     plp   1.3       2
23       2def   asp     plp   1.7       3

For this particular task, RMSD <= 1.5 is considered a success.对于这个特定的任务,RMSD <= 1.5 被认为是成功的。 I want to calculate the percentage success rate for all systems in my dataset, split by score and rescore combination.我想计算我的数据集中所有系统的成功率百分比,按分数和重新分数组合划分。 I want to reported the result as the mean and standard deviation (of percentage success rate) across the 3 repeats.我想将结果报告为 3 次重复的平均值和标准偏差(成功率百分比)。

desired output:所需的 output:

rescore    asp           plp             
vals      mean     sd   mean     sd     
score                                           
asp        0.0    0.0   83.3   28.9  
plp      100.0    0.0    0.0    0.0

My attempt so far:到目前为止我的尝试:

df = pd.read_csv("test.csv", index_col=0)
systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100)

out:出去:

rescore    asp                  plp             
repeat       1      2      3      1      2     3
score                                           
asp        0.0    0.0    0.0  100.0  100.0  50.0
plp      100.0  100.0  100.0    0.0    0.0   0.0

So my question is:所以我的问题是:

How do I aggregate across my 'repeat' column to yield the mean and standard deviation?如何在我的“重复”列中聚合以产生平均值和标准差?

You can .melt() the pivoted table and pivot it again.您可以.melt()旋转表和 pivot 再次。

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).melt(ignore_index=False)\
    .reset_index()\
    .pivot_table(index='score',
                 columns='rescore', 
                 values='value', 
                 aggfunc=['mean', 'std'])

output: output:

           mean          std       
rescore     asp    plp   asp    plp
score                              
asp       0.000 83.333 0.000 28.868
plp     100.000  0.000 0.000  0.000

Or, you could move repeat argument to index parameter in your pd.pivot_table() function and use .groupby() method.或者,您可以将repeat参数移动到pd.pivot_table() function 中的index参数并使用.groupby()方法。

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index=['score', 'repeat'], 
               columns= 'rescore', 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).reset_index()\
    .groupby('score')[df['rescore'].unique()].agg(['mean', 'std'])\
    .swaplevel(0,1,1)\
    .sort_index(axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM