简体   繁体   中英

How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group . However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?

You can construct a Boolean series of flags and filter before you groupby . First let's create an example dataframe and look at the number of row for each unique value in the first series:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))

print(df[0].value_counts())

0    6
1    4
Name: 0, dtype: int64

Then define a fraction, eg 50% below, and construct a Boolean series for filtering:

n = 0.5

g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n

Then apply the condition, set the index as the first series and (if required) sort the index:

df = df.loc[flags].set_index(0).sort_index()

print(df)

   1  2
0      
0  1  1
0  1  1
0  1  0
1  1  1
1  1  0

As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.

Here is another option which builds on some of the answers in the post you mentioned

First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.

My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish

def round_func(x, up=True):
    '''Function to round up or round down a float'''
    if up:
        return int(x+1)
    else:
        return int(x)

Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.

import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})

p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply(                        # group by the ids
    lambda x: x.reset_index()['value'].nlargest(        # in each group take the top rows by column 'value'
        round_func(x.count().max()*p)))        # calculate how many to keep from each group

df_top = df_top.reset_index().drop('level_1', axis=1)   # make the dataframe nice again

df looked like this

   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

df_top looks like this

   id  value
0   1      3
1   2      4
2   2      3
3   3      1
4   4      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM