简体   繁体   中英

Pandas How to Not Include Grouping Index in Groupby-Apply

In Pandas, if I perform a groupby-apply operation and look at each of the 'groupby' objects, the grouping index are retained.

>>> df = pd.DataFrame({'gender':['M','M','F','F','F','M'],'age':[10,10,20,20,30,30],'income':[10000,15000,20000,25000,30000,35000],'education':[0,1,2,2,2,3]})
>>> df
   age  education gender  income
0   10          0      M   10000
1   10          1      M   15000
2   20          2      F   20000
3   20          2      F   25000
4   30          2      F   30000
5   30          3      M   35000
>>> df.groupby(['age','education']).apply(lambda x:x.iloc[np.argmax(x['income'].values),:])
                age  education gender  income
age education
10  0           10          0      M   10000
    1           10          1      M   15000
20  2           20          2      F   25000
30  2           30          2      F   30000
    3           30          3      M   35000

You can see here that ['age','education'] appear on both the index and the values in the returns. This to me is redundant and clumsy to work with. Is there a way to not include the grouping index in the 'groupby' object? For example, to get something like this:

                gender  income
age education
10  0           M       10000
    1           M       15000
20  2           F       25000
30  2           F       30000
    3           M       35000

PS I know I can call dropindex() but just want to know if there is a cleaner way, and is there any reason to retain the grouping index in the grouped object. I come from R world to Python, and in R data.table, you can do the same operation in a concise manner with dt[,.SD[which.max(income)],by=.(age,education)]

The groupby syntax can be a little clunky, I admit. But it's a little cleaner if you find the indices of the maximal incomes and use that to index into df:

In [46]: df.groupby(['age','education'])['income'].idxmax()
Out[46]: 
age  education
10   0            0
     1            1
20   2            3
30   2            4
     3            5
Name: income, dtype: int64

In [47]: df.loc[df.groupby(['age','education'])['income'].idxmax()]
Out[47]: 
   age  education gender  income
0   10          0      M   10000
1   10          1      M   15000
3   20          2      F   25000
4   30          2      F   30000
5   30          3      M   35000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM