I am carrying out a spatial alignment task where I am exploring the effect of different score/rescore functions on the quality of the alignment (measured by RMSD). I have long form data where I have run all scoring / rescoring combinations for different systems and have repeated 3 times.
Here's some sample test data:
identifier score rescore rmsd repeat
0 1abc plp asp 1.2 1
1 1abc plp asp 1.3 2
2 1abc plp asp 1.5 3
3 1abc plp plp 3.2 1
4 1abc plp plp 3.3 2
5 1abc plp plp 3.5 3
6 1abc asp asp 5.2 1
7 1abc asp asp 5.3 2
8 1abc asp asp 5.5 3
9 1abc asp plp 1.2 1
10 1abc asp plp 1.3 2
11 1abc asp plp 1.5 3
12 2def plp asp 1.0 1
13 2def plp asp 1.1 2
14 2def plp asp 1.2 3
15 2def plp plp 3.0 1
16 2def plp plp 3.1 2
17 2def plp plp 3.2 3
18 2def asp asp 5.0 1
19 2def asp asp 5.1 2
20 2def asp asp 5.2 3
21 2def asp plp 1.0 1
22 2def asp plp 1.3 2
23 2def asp plp 1.7 3
For this particular task, RMSD <= 1.5 is considered a success. I want to calculate the percentage success rate for all systems in my dataset, split by score and rescore combination. I want to reported the result as the mean and standard deviation (of percentage success rate) across the 3 repeats.
desired output:
rescore asp plp
vals mean sd mean sd
score
asp 0.0 0.0 83.3 28.9
plp 100.0 0.0 0.0 0.0
My attempt so far:
df = pd.read_csv("test.csv", index_col=0)
systems = len(set(df.identifier))
pd.pivot_table(df,
index='score',
columns= ['rescore', 'repeat'],
values='rmsd',
aggfunc=lambda x:((x <= 1.5).sum()/systems)*100)
out:
rescore asp plp
repeat 1 2 3 1 2 3
score
asp 0.0 0.0 0.0 100.0 100.0 50.0
plp 100.0 100.0 100.0 0.0 0.0 0.0
So my question is:
How do I aggregate across my 'repeat' column to yield the mean and standard deviation?
You can .melt()
the pivoted table and pivot it again.
systems = len(set(df.identifier))
pd.pivot_table(df,
index='score',
columns= ['rescore', 'repeat'],
values='rmsd',
aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).melt(ignore_index=False)\
.reset_index()\
.pivot_table(index='score',
columns='rescore',
values='value',
aggfunc=['mean', 'std'])
output:
mean std
rescore asp plp asp plp
score
asp 0.000 83.333 0.000 28.868
plp 100.000 0.000 0.000 0.000
Or, you could move repeat
argument to index
parameter in your pd.pivot_table()
function and use .groupby()
method.
systems = len(set(df.identifier))
pd.pivot_table(df,
index=['score', 'repeat'],
columns= 'rescore',
values='rmsd',
aggfunc=lambda x:((x <= 1.5).sum()/systems)*100
).reset_index()\
.groupby('score')[df['rescore'].unique()].agg(['mean', 'std'])\
.swaplevel(0,1,1)\
.sort_index(axis=1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.