[英]How to keep only the top n% rows of each group of a pandas dataframe?
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group . 我看到这个问题的一个变体,要求将每个组的前n行保留在pandas数据框中,解决方案使用n作为绝对数而不是百分比,此处Pandas在每个组中获得最前n条记录 。 However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group.
但是,在我的数据框中,每个组中都有不同数量的行,我想保留每个组中前n%个行。 How would I approach this problem?
我将如何解决这个问题?
You can construct a Boolean series of flags and filter before you groupby
. 在
groupby
之前,您可以构造布尔值标志和过滤器系列。 First let's create an example dataframe and look at the number of row for each unique value in the first series: 首先,让我们创建一个示例数据框,并查看第一个系列中每个唯一值的行数:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, eg 50% below, and construct a Boolean series for filtering: 然后定义一个分数,例如低于50%,并构造一个布尔级数进行过滤:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index: 然后应用条件,将索引设置为第一个序列,并(如果需要)对索引进行排序:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0
indices and 2 1
indices, in each case half the number in the original dataframe. 如您所见,结果数据帧仅具有3
0
索引和2 1
索引,在每种情况下均为原始数据帧数量的一半。
Here is another option which builds on some of the answers in the post you mentioned 这是您提到的帖子中的一些答案的另一种选择
First of all here is a quick function to either round up or round down. 首先,这里有一个快速功能,可以向上或向下取整。 If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows.
如果我们希望数据框的前30%的行长8行,那么我们将尝试使用2.4行。 So we will need to either round up or down.
因此,我们将需要向上或向下取整。
My preferred option is to round up. 我的首选是四舍五入。 This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row.
这是因为,对于eaxample,如果我们要占据50%的行,但是只有一组只有一行,那么我们仍然会保留那一行。 I kept this separate so that you can change the rounding as you wish
我将其分开放置,以便您可以根据需要更改舍入
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p
to be the fraction of the rows from each group that we should keep. 接下来,我制作一个要使用的数据框,并将参数
p
设置为每个组中应保留的行的分数。 Everything follows and I have commented it so that hopefully you can follow. 一切都遵循了,我已经发表了评论,希望您可以遵循。
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this df看起来像这样
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this df_top看起来像这样
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.