Pandas fuzzy group summary statistics

Question

I have a data frame defined from a CSV and would like to calculate basic summary statistics eg mean, variance, ... for the train part of all the models.

Inserting a model number and grouping by that would work fine - but does not seem to be a good solution. How can I get the summary statistics per model (only for training), as a group_by modelName does not work because of the counter.

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()

or

df[df.typeOfRun != 'validation'].describe()

do not yield the desired results.

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation

Answer 1

IIUC you can use DataFrameGroupBy.describe :

print (df.groupby(['modelName', 'typeOfRun']).describe())

                                             f1_R     kappa  
modelName              typeOfRun                             
00_testing-validation  validation count  1.000000  1.000000  
                                  mean   0.832731  0.097878  
                                  std         NaN       NaN  
                                  min    0.832731  0.097878  
                                  25%    0.832731  0.097878  
                                  50%    0.832731  0.097878  
                                  75%    0.832731  0.097878  
                                  max    0.832731  0.097878  
00_testing_1-training  training   count  1.000000  1.000000  
                                  mean   0.831213  0.099879  
                                  std         NaN       NaN  
                                  min    0.831213  0.099879  
                                  25%    0.831213  0.099879  
                                  50%    0.831213  0.099879  
                                  75%    0.831213  0.099879  
                                  max    0.831213  0.099879  
00_testing_2-training  training   count  1.000000  1.000000  
                                  mean   0.839183  0.101592  
                                  std         NaN       NaN  
...
...

You can groupby by Series created by split and selected first item of list by str[0] :

print (df.modelName.str.split('_').str[0])
0    00
1    00
2    00
3    00
4    01
5    01
6    01
7    01
Name: modelName, dtype: object

print (df.groupby([df.modelName.str.split('_').str[0]]).describe())
                    AUC_R  Accuracy  Error;rate  False;negative;rate  \
modelName                                                              
00        count  4.000000  4.000000    4.000000             4.000000   
          mean   0.768913  0.722449    0.277551             0.181730   
          std    0.004149  0.004924    0.004924             0.011270   
          min    0.765376  0.718242    0.270494             0.165198   
          25%    0.766368  0.719753    0.276280             0.179275   
          50%    0.767757  0.721024    0.278976             0.186147   
          75%    0.770302  0.723720    0.280247             0.188601   
          max    0.774762  0.729506    0.281758             0.189427   
01        count  4.000000  4.000000    4.000000             4.000000   
          mean   0.771151  0.733358    0.266642             0.188704   
          std    0.004452  0.003198    0.003198             0.011488   
          min    0.767635  0.730054    0.262587             0.171806   
          25%    0.768771  0.731325    0.264984             0.186674   
          50%    0.769662  0.732983    0.267017             0.192937   
          75%    0.772042  0.735016    0.268675             0.194968   
          max    0.777643  0.737413    0.269946             0.197137   
          ...
          ...

Pandas fuzzy group summary statistics

Question

1 answers

solution1
1 ACCPTED 2016-10-06 07:50:54

Pandas fuzzy group summary statistics

Question

1 answers

solution1 1 ACCPTED 2016-10-06 07:50:54

solution1
1 ACCPTED 2016-10-06 07:50:54