简体   繁体   中英

python pandas groupby aggregate with min function

I have a dataframe like:

df = pd.DataFrame({'year': [2018, 2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 2019],
                   'label': ['A', 'B', 'C', 'A', 'B', 'D', 'A', 'E', 'F', 'G', 'E', 'E'],
                   'cat' : [236, 546, 671, 555, 871, 229, 811, 992, 227, 341, 701, 508],
                   'value': [2, 5, 6, 1, 9, 4, 7, 8, 13, 11, 3, 12]})

And I would like to group this by year and label and return for eacht group the row with the lowest value for value . The row has to include the (original) index value as well as all the columns (from df ).

So, the result should be like:

 3  2018  A  555   1
 1  2018  B  546   5
 2  2018  C  671   6
 5  2018  D  229   4
 6  2019  A  811   7
10  2019  E  701   3
 8  2019  F  227  13
 9  2019  G  341  11

My code so far:

df.groupby(by=['year', 'label']).min().reset_index()

This has the correct values for value , but not for cat . Also the (original) index is missing.

Any suggestions how to solve this?

You can do with idxmin() :

df.loc[df.groupby(['year','label']).value.idxmin()]

Output:

    year label  cat  value
3   2018     A  555      1
1   2018     B  546      5
2   2018     C  671      6
5   2018     D  229      4
6   2019     A  811      7
10  2019     E  701      3
8   2019     F  227     13
9   2019     G  341     11

This is a good indication that you want to sort and then call drop_duplicates, since groupby destroys the original index.

# This is a little simpler but the order will change.
# df.sort_values('value').drop_duplicates(['year', 'label'])

df.sort_values(['year', 'label', 'value']).drop_duplicates(['year', 'label'])

    year label  cat  value
3   2018     A  555      1
1   2018     B  546      5
2   2018     C  671      6
5   2018     D  229      4
6   2019     A  811      7
10  2019     E  701      3
8   2019     F  227     13
9   2019     G  341     11

1.First you need to find min
res = df.groupby(['Year','label'],as_index=False)['value'].min()

2.Then merge with original df final_result = pd.merge(df[['Year','label','cat']],res,on=['Year','label'],how='inner')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM