[英]how to subset dataframe from first row to the highest value in a column?
I want to subset a dataframe based on a column with cumulative values (the column "value").我想根据具有累积值的列(“值”列)对 dataframe 进行子集化。
My dummy dataframe is:我的虚拟 dataframe 是:
index x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
3 24.25 50.65 a 1 29
4 24.25 50.65 a 1 29
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
8 24.25 50.65 b 1 5
expected output:预期 output:
index x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
3 24.25 50.65 b 1 3
4 24.25 50.65 b 1 4
5 24.25 50.65 b 1 5
I have already tried:我已经尝试过:
n=1
df_sub= df[df.groupby(['x','y','g1', 'g2']).apply(
lambda x: x.nlargest(n, 'value', keep='first')).reset_index(drop=True)
But it does not keep the rows with values lower than maximum.但它不会保留值低于最大值的行。 As far as I know, if you change n to higher values you will get nth highest values but the point is that I have no idea about the range between the first row and the highest value of value.
据我所知,如果将 n 更改为更高的值,您将获得第 n 个最高值,但关键是我不知道第一行和最高值之间的范围。 Any help is highly appreciated.
非常感谢任何帮助。 Omid.
奥米德。
A slightly different approach, filtering where value
is the max per group ( groupby transform
) or not duplicated ( duplicated
):一种稍微不同的方法,过滤其中
value
是每组的最大值( groupby transform
)或不重复( duplicated
):
max_m = (
df.groupby(['x', 'y', 'g1', 'g2'])['value']
.transform('max')
.ne(df['value'])
)
dup_m = ~df['value'].duplicated()
filtered_df = df[max_m | dup_m]
filtered_df
: filtered_df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
The benefit of this approach is that it will only remove duplicated maximums not other duplicates and order of the frame does not matter:这种方法的好处是它只会删除重复的最大值而不是其他重复,并且帧的顺序无关紧要:
df
: df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 29 # Max
2 24.25 50.65 a 1 25 # Duplicated but not Max
3 24.25 50.65 a 1 28
4 24.25 50.65 a 1 29 # Max (2)
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
8 24.25 50.65 b 1 5
filtered_df
: filtered_df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 29 # First Max is kept
2 24.25 50.65 a 1 25 # Duplicated but not Max (kept)
3 24.25 50.65 a 1 28
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
Are you perhaps looking for df.drop_duplicates() ?您是否正在寻找df.drop_duplicates() ?
With subset
you can specify on which columns to operate, and with keep
which rows to keep.使用
subset
,您可以指定要操作的列,并keep
要保留的行。
>>> df.drop_duplicates(subset=['value'], keep='first')
index x y g1 g2 value
0 0 24.25 50.65 a 1 25
1 1 24.25 50.65 a 1 28
2 2 24.25 50.65 a 1 29
5 5 24.25 50.65 b 1 3
6 6 24.25 50.65 b 1 4
7 7 24.25 50.65 b 1 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.