Pandas Dataframe GroupBy - Displaying Group Statistics

Question

For the Pandas dataframe:

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)]
                    },  columns= ['id','code','colour', 'texture', 'size'])

I run the line below to get the aggregated sizes grouped by code and colour pairs:

grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>>     code colour  size
>> 0    one  black   987
>> 1    one  white   972
>> 2  three  black   972
>> 3  three  white   488
>> 4    two  black  1162
>> 5    two  white  1158
>> [6 rows x 3 columns]

In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:

i. average value (np.avg) per group

ii. the id of the row with the max size for a given group,

iii. how many times the group occured (eg code=one, colour=black, 12 times)

Question: What is the fastest way to do this? Would I have to use apply() and a proprietary function?

Answer 1

You can pass a list of functions to be applied to the group, eg:

grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()

Since argmax is the index of the maximum row, you will need to look them up on the original dataframe:

grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id

NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use agg with a dictionary with a list of functions eg agg({'size': [np.sum, np.average]}) . This results in MultiIndex columns, which means that when getting the IDs for the maximum size in each group you need to do:

grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id

Pandas Dataframe GroupBy - Displaying Group Statistics

Question

1 answers

solution1
3 ACCPTED 2014-06-13 10:55:18

Pandas Dataframe GroupBy - Displaying Group Statistics

Question

1 answers

solution1 3 ACCPTED 2014-06-13 10:55:18

solution1
3 ACCPTED 2014-06-13 10:55:18