简体   繁体   English

如何仅保留熊猫数据帧每组的前n%行?

[英]How to keep only the top n% rows of each group of a pandas dataframe?

I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group . 我看到这个问题的一个变体,要求将每个组的前n行保留在pandas数据框中,解决方案使用n作为绝对数而不是百分比,此处Pandas在每个组中获得最前n条记录 However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. 但是,在我的数据框中,每个组中都有不同数量的行,我想保留每个组中前n%个行。 How would I approach this problem? 我将如何解决这个问题?

You can construct a Boolean series of flags and filter before you groupby . groupby之前,您可以构造布尔值标志和过滤器系列。 First let's create an example dataframe and look at the number of row for each unique value in the first series: 首先,让我们创建一个示例数据框,并查看第一个系列中每个唯一值的行数:

np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))

print(df[0].value_counts())

0    6
1    4
Name: 0, dtype: int64

Then define a fraction, eg 50% below, and construct a Boolean series for filtering: 然后定义一个分数,例如低于50%,并构造一个布尔级数进行过滤:

n = 0.5

g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n

Then apply the condition, set the index as the first series and (if required) sort the index: 然后应用条件,将索引设置为第一个序列,并(如果需要)对索引进行排序:

df = df.loc[flags].set_index(0).sort_index()

print(df)

   1  2
0      
0  1  1
0  1  1
0  1  0
1  1  1
1  1  0

As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe. 如您所见,结果数据帧仅具有3 0索引和2 1索引,在每种情况下均为原始数据帧数量的一半。

Here is another option which builds on some of the answers in the post you mentioned 这是您提到的帖子中的一些答案的另一种选择

First of all here is a quick function to either round up or round down. 首先,这里有一个快速功能,可以向上或向下取整。 If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. 如果我们希望数据框的前30%的行长8行,那么我们将尝试使用2.4行。 So we will need to either round up or down. 因此,我们将需要向上或向下取整。

My preferred option is to round up. 我的首选是四舍五入。 This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. 这是因为,对于eaxample,如果我们要占据50%的行,但是只有一组只有一行,那么我们仍然会保留那一行。 I kept this separate so that you can change the rounding as you wish 我将其分开放置,以便您可以根据需要更改舍入

def round_func(x, up=True):
    '''Function to round up or round down a float'''
    if up:
        return int(x+1)
    else:
        return int(x)

Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. 接下来,我制作一个要使用的数据框,并将参数p设置为每个组中应保留的行的分数。 Everything follows and I have commented it so that hopefully you can follow. 一切都遵循了,我已经发表了评论,希望您可以遵循。

import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})

p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply(                        # group by the ids
    lambda x: x.reset_index()['value'].nlargest(        # in each group take the top rows by column 'value'
        round_func(x.count().max()*p)))        # calculate how many to keep from each group

df_top = df_top.reset_index().drop('level_1', axis=1)   # make the dataframe nice again

df looked like this df看起来像这样

   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

df_top looks like this df_top看起来像这样

   id  value
0   1      3
1   2      4
2   2      3
3   3      1
4   4      1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM