简体   繁体   中英

Calculating mean grades for deciles within a dataset with Python, grouped by another field

import pandas as pd  
import csv



df_orig = pd.read_csv('test_sample.csv')  
df_orig = df_orig[(df_orig['number']>0)]  
decile_stats = df_orig.groupby(pd.qcut(df_orig.number, 5))['number'].mean()  


print(decile_stats)

I'm trying to use python to calculate statistics for deciles of my dataset. I can calculate the mean of each decile using qcut, but I want to group my numbers by the values in a second column. This way the deciles are calculated and reported on values according to their value in the family column.

family number
0 1000 0.04
1 1000 0.20
2 1000 0.04
3 1000 0.16
4 1000 0.08
5 1000 0.02
6 1000 0.02
7 1000 0.02
8 1000 0.64
9 1000 0.04

My desired output would be:

Q1 1000 0.028617
Q2 1000 0.105060
Q3 1000 0.452467
Q4 1000 2.644886
Q5 1000 141.749797...

etc. with each 'family' shown, 1000, 2000, 3000 in this case.

IIUC, you can use:

N = 3
labels = [f'Q{i}' for i in range(1, N+1)]
decile = lambda x: x.groupby(pd.qcut(x['number'], N, labels=labels)).mean()

out = df.groupby('family').apply(decile)['number'].rename('mean').reset_index()

Output:

>>> out
   family number      mean
0    1000     Q1  0.030000
1    1000     Q2  0.080000
2    1000     Q3  0.333333

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM