In Pandas, if I perform a groupby-apply operation and look at each of the 'groupby' objects, the grouping index are retained.
>>> df = pd.DataFrame({'gender':['M','M','F','F','F','M'],'age':[10,10,20,20,30,30],'income':[10000,15000,20000,25000,30000,35000],'education':[0,1,2,2,2,3]})
>>> df
age education gender income
0 10 0 M 10000
1 10 1 M 15000
2 20 2 F 20000
3 20 2 F 25000
4 30 2 F 30000
5 30 3 M 35000
>>> df.groupby(['age','education']).apply(lambda x:x.iloc[np.argmax(x['income'].values),:])
age education gender income
age education
10 0 10 0 M 10000
1 10 1 M 15000
20 2 20 2 F 25000
30 2 30 2 F 30000
3 30 3 M 35000
You can see here that ['age','education']
appear on both the index and the values in the returns. This to me is redundant and clumsy to work with. Is there a way to not include the grouping index in the 'groupby' object? For example, to get something like this:
gender income
age education
10 0 M 10000
1 M 15000
20 2 F 25000
30 2 F 30000
3 M 35000
PS I know I can call dropindex()
but just want to know if there is a cleaner way, and is there any reason to retain the grouping index in the grouped object. I come from R world to Python, and in R data.table, you can do the same operation in a concise manner with dt[,.SD[which.max(income)],by=.(age,education)]
The groupby syntax can be a little clunky, I admit. But it's a little cleaner if you find the indices of the maximal incomes and use that to index into df:
In [46]: df.groupby(['age','education'])['income'].idxmax()
Out[46]:
age education
10 0 0
1 1
20 2 3
30 2 4
3 5
Name: income, dtype: int64
In [47]: df.loc[df.groupby(['age','education'])['income'].idxmax()]
Out[47]:
age education gender income
0 10 0 M 10000
1 10 1 M 15000
3 20 2 F 25000
4 30 2 F 30000
5 30 3 M 35000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.