简体   繁体   中英

How to aggregate data in a Pandas Pivot Table?

I am carrying out a spatial alignment task where I am exploring the effect of different score/rescore functions on the quality of the alignment (measured by RMSD). I have long form data where I have run all scoring / rescoring combinations for different systems and have repeated 3 times.

Here's some sample test data:

   identifier score rescore  rmsd  repeat
0        1abc   plp     asp   1.2       1
1        1abc   plp     asp   1.3       2
2        1abc   plp     asp   1.5       3
3        1abc   plp     plp   3.2       1
4        1abc   plp     plp   3.3       2
5        1abc   plp     plp   3.5       3
6        1abc   asp     asp   5.2       1
7        1abc   asp     asp   5.3       2
8        1abc   asp     asp   5.5       3
9        1abc   asp     plp   1.2       1
10       1abc   asp     plp   1.3       2
11       1abc   asp     plp   1.5       3
12       2def   plp     asp   1.0       1
13       2def   plp     asp   1.1       2
14       2def   plp     asp   1.2       3
15       2def   plp     plp   3.0       1
16       2def   plp     plp   3.1       2
17       2def   plp     plp   3.2       3
18       2def   asp     asp   5.0       1
19       2def   asp     asp   5.1       2
20       2def   asp     asp   5.2       3
21       2def   asp     plp   1.0       1
22       2def   asp     plp   1.3       2
23       2def   asp     plp   1.7       3

For this particular task, RMSD <= 1.5 is considered a success. I want to calculate the percentage success rate for all systems in my dataset, split by score and rescore combination. I want to reported the result as the mean and standard deviation (of percentage success rate) across the 3 repeats.

desired output:

rescore    asp           plp             
vals      mean     sd   mean     sd     
score                                           
asp        0.0    0.0   83.3   28.9  
plp      100.0    0.0    0.0    0.0

My attempt so far:

df = pd.read_csv("test.csv", index_col=0)
systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100)

out:

rescore    asp                  plp             
repeat       1      2      3      1      2     3
score                                           
asp        0.0    0.0    0.0  100.0  100.0  50.0
plp      100.0  100.0  100.0    0.0    0.0   0.0

So my question is:

How do I aggregate across my 'repeat' column to yield the mean and standard deviation?

You can .melt() the pivoted table and pivot it again.

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index='score', 
               columns= ['rescore', 'repeat'], 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).melt(ignore_index=False)\
    .reset_index()\
    .pivot_table(index='score',
                 columns='rescore', 
                 values='value', 
                 aggfunc=['mean', 'std'])

output:

           mean          std       
rescore     asp    plp   asp    plp
score                              
asp       0.000 83.333 0.000 28.868
plp     100.000  0.000 0.000  0.000

Or, you could move repeat argument to index parameter in your pd.pivot_table() function and use .groupby() method.

systems = len(set(df.identifier))

pd.pivot_table(df, 
               index=['score', 'repeat'], 
               columns= 'rescore', 
               values='rmsd', 
               aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).reset_index()\
    .groupby('score')[df['rescore'].unique()].agg(['mean', 'std'])\
    .swaplevel(0,1,1)\
    .sort_index(axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM