For the Pandas dataframe:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)]
}, columns= ['id','code','colour', 'texture', 'size'])
I run the line below to get the aggregated sizes grouped by code
and colour
pairs:
grouped = df.groupby(['code', 'colour']).agg({'size' : np.sum}).reset_index()
>> grouped
>> code colour size
>> 0 one black 987
>> 1 one white 972
>> 2 three black 972
>> 3 three white 488
>> 4 two black 1162
>> 5 two white 1158
>> [6 rows x 3 columns]
In additon to the aggreageted (np.sum) sizes, I want to get separate columns for:
i. average value (np.avg) per group
ii. the id of the row with the max size for a given group,
iii. how many times the group occured (eg code=one, colour=black, 12 times)
Question: What is the fastest way to do this? Would I have to use apply()
and a proprietary function?
You can pass a list of functions to be applied to the group, eg:
grouped = df.groupby(['code', 'colour'])['size'].agg([np.sum, np.average, np.size, np.argmax]).reset_index()
Since argmax
is the index of the maximum row, you will need to look them up on the original dataframe:
grouped['max_row_id'] = df.ix[grouped['argmax']].reset_index(grouped.index).id
NOTE: I selected the 'size' column because all the functions apply to that column. If you wanted to do a different set of functions for different columns, you can use agg
with a dictionary with a list of functions eg agg({'size': [np.sum, np.average]})
. This results in MultiIndex
columns, which means that when getting the IDs for the maximum size in each group you need to do:
grouped['max_row_id'] = df.ix[grouped['size']['argmax']].reset_index(grouped.index).id
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.