如何将 dataframe 从第一行子集到列中的最大值？

Question

I want to subset a dataframe based on a column with cumulative values (the column "value").我想根据具有累积值的列（“值”列）对 dataframe 进行子集化。
My dummy dataframe is:我的虚拟 dataframe 是：

index  x         y   g1 g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  a  1       29
4      24.25  50.65  a  1       29
5      24.25  50.65  b  1       3
6      24.25  50.65  b  1       4
7      24.25  50.65  b  1       5
8      24.25  50.65  b  1       5

expected output:预期 output：

index  x         y   g1  g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  b  1       3
4      24.25  50.65  b  1       4
5      24.25  50.65  b  1       5

I have already tried:我已经尝试过：

n=1
df_sub= df[df.groupby(['x','y','g1', 'g2']).apply(
                                lambda x: x.nlargest(n, 'value', keep='first')).reset_index(drop=True)

But it does not keep the rows with values lower than maximum.但它不会保留值低于最大值的行。 As far as I know, if you change n to higher values you will get nth highest values but the point is that I have no idea about the range between the first row and the highest value of value.据我所知，如果将 n 更改为更高的值，您将获得第 n 个最高值，但关键是我不知道第一行和最高值之间的范围。 Any help is highly appreciated.非常感谢任何帮助。 Omid.奥米德。

Answer 1

A slightly different approach, filtering where value is the max per group ( groupby transform ) or not duplicated ( duplicated ):一种稍微不同的方法，过滤其中value是每组的最大值（ groupby transform ）或不重复（ duplicated ）：

max_m = (
    df.groupby(['x', 'y', 'g1', 'g2'])['value']
        .transform('max')
        .ne(df['value'])
)
dup_m = ~df['value'].duplicated()
filtered_df = df[max_m | dup_m]

filtered_df : filtered_df ：

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     28
2  24.25  50.65  a   1     29
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

The benefit of this approach is that it will only remove duplicated maximums not other duplicates and order of the frame does not matter:这种方法的好处是它只会删除重复的最大值而不是其他重复，并且帧的顺序无关紧要：

df : df ：

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # Max
2  24.25  50.65  a   1     25  # Duplicated but not Max
3  24.25  50.65  a   1     28
4  24.25  50.65  a   1     29  # Max (2)
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5
8  24.25  50.65  b   1      5

filtered_df : filtered_df ：

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # First Max is kept
2  24.25  50.65  a   1     25  # Duplicated but not Max (kept)
3  24.25  50.65  a   1     28
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

Answer 2

Are you perhaps looking for df.drop_duplicates() ?您是否正在寻找df.drop_duplicates() ？

With subset you can specify on which columns to operate, and with keep which rows to keep.使用subset ，您可以指定要操作的列，并keep要保留的行。

>>> df.drop_duplicates(subset=['value'], keep='first')
   index      x      y g1  g2  value
0      0  24.25  50.65  a   1     25
1      1  24.25  50.65  a   1     28
2      2  24.25  50.65  a   1     29
5      5  24.25  50.65  b   1      3
6      6  24.25  50.65  b   1      4
7      7  24.25  50.65  b   1      5

如何将 dataframe 从第一行子集到列中的最大值？

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-06-12 16:46:25

解决方案2
0 2021-06-12 16:32:24

如何将 dataframe 从第一行子集到列中的最大值？

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-06-12 16:46:25

解决方案2 0 2021-06-12 16:32:24

解决方案1
1 已采纳 2021-06-12 16:46:25

解决方案2
0 2021-06-12 16:32:24