如何在 Pandas Pivot 表中聚合數據？

Question

我正在執行空間 alignment 任務，我正在探索不同評分/重新評分函數對 alignment 質量的影響（由 RMSD 測量）。 我有長表格數據，我已經為不同系統運行了所有評分/重新評分組合，並重復了 3 次。

以下是一些示例測試數據：

   identifier score rescore  rmsd  repeat
0        1abc   plp     asp   1.2       1
1        1abc   plp     asp   1.3       2
2        1abc   plp     asp   1.5       3
3        1abc   plp     plp   3.2       1
4        1abc   plp     plp   3.3       2
5        1abc   plp     plp   3.5       3
6        1abc   asp     asp   5.2       1
7        1abc   asp     asp   5.3       2
8        1abc   asp     asp   5.5       3
9        1abc   asp     plp   1.2       1
10       1abc   asp     plp   1.3       2
11       1abc   asp     plp   1.5       3
12       2def   plp     asp   1.0       1
13       2def   plp     asp   1.1       2
14       2def   plp     asp   1.2       3
15       2def   plp     plp   3.0       1
16       2def   plp     plp   3.1       2
17       2def   plp     plp   3.2       3
18       2def   asp     asp   5.0       1
19       2def   asp     asp   5.1       2
20       2def   asp     asp   5.2       3
21       2def   asp     plp   1.0       1
22       2def   asp     plp   1.3       2
23       2def   asp     plp   1.7       3

對於這個特定的任務，RMSD <= 1.5 被認為是成功的。 我想計算我的數據集中所有系統的成功率百分比，按分數和重新分數組合划分。 我想將結果報告為 3 次重復的平均值和標准偏差（成功率百分比）。

所需的 output：

rescore    asp           plp             
vals      mean     sd   mean     sd     
score                                           
asp        0.0    0.0   83.3   28.9  
plp      100.0    0.0    0.0    0.0

到目前為止我的嘗試：

df = pd.read_csv("test.csv", index_col=0)
systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100)

出去：

rescore    asp                  plp             
repeat       1      2      3      1      2     3
score                                           
asp        0.0    0.0    0.0  100.0  100.0  50.0
plp      100.0  100.0  100.0    0.0    0.0   0.0

所以我的問題是：

如何在我的“重復”列中聚合以產生平均值和標准差？

Answer 1

您可以.melt()旋轉表和 pivot 再次。

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).melt(ignore_index=False)\
    .reset_index()\
    .pivot_table(index='score',
                 columns='rescore', 
                 values='value', 
                 aggfunc=['mean', 'std'])

output：

           mean          std       
rescore     asp    plp   asp    plp
score                              
asp       0.000 83.333 0.000 28.868
plp     100.000  0.000 0.000  0.000

或者，您可以將repeat參數移動到pd.pivot_table() function 中的index參數並使用.groupby()方法。

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index=['score', 'repeat'], 
               columns= 'rescore', 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).reset_index()\
    .groupby('score')[df['rescore'].unique()].agg(['mean', 'std'])\
    .swaplevel(0,1,1)\
    .sort_index(axis=1)

如何在 Pandas Pivot 表中聚合數據？

問題描述

1 個解決方案

解決方案1
2 已采納 2021-06-02 13:28:50

如何在 Pandas Pivot 表中聚合數據？

問題描述

1 個解決方案

解決方案1 2 已采納 2021-06-02 13:28:50

解決方案1
2 已采納 2021-06-02 13:28:50