简体   繁体   中英

Get top n values per category in pandas retaining all columns

After some transformations I got the following dataframe , how do I proceed to obtain the top n records by a column in this case short_name and using other as indicator frequency . I read this post but the problem with both solutions is that they get rid of the column product_name , they just retain the grouped column and I need to keep them all.

short_name          product_id    frequency
Yoghurt y cereales  975009684     32
Yoghurt y cereales  975009685     21
Yoghurt y cereales  975009700     16
Yoghurt y Cereales  21097         16
Yoghurt Bebible     21329         68
Yoghurt Bebible     21328         67
Yoghurt Bebible     21500         31

You could first sort the dataframe, then use groupby :

df.sort_values('frequency', ascending=False).groupby('short_name').head(2)
Out[28]: 
           short_name  product_id  frequency
4     Yoghurt Bebible       21329         68
5     Yoghurt Bebible       21328         67
0  Yoghurt y cereales   975009684         32
1  Yoghurt y cereales   975009685         21
3  Yoghurt y Cereales       21097         16

I'd try to use nlargest method:

In [5]: df.groupby('short_name', as_index=False).apply(lambda x: x.nlargest(2, 'frequency'))
Out[5]:
             short_name  product_id  frequency
0 4     Yoghurt Bebible       21329         68
  5     Yoghurt Bebible       21328         67
1 3  Yoghurt y Cereales       21097         16
2 0  Yoghurt y cereales   975009684         32
  1  Yoghurt y cereales   975009685         21

You can try this:

df.groupby('short_name', as_index=False).apply(lambda x: x.sort_values(by='frequency',ascending=False).head(2)).reset_index(drop=True)

Output:

           short_name  product_id  frequency
0     Yoghurt Bebible       21329         68
1     Yoghurt Bebible       21328         67
2  Yoghurt y Cereales       21097         16
3  Yoghurt y cereales   975009684         32
4  Yoghurt y cereales   975009685         21

If I understand your question correctly using .apply is not necessary in this case. This is something I would avoid unless there's no other option because of performance issues.

Try this:

df.sort_values(by='frequency',ascending=False).groupby('short_name').head(2).reset_index()
df.groupby('short_name').nlargest(2, 'frequency').reset_index()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM