简体   繁体   English

如何将 dataframe 从第一行子集到列中的最大值?

[英]how to subset dataframe from first row to the highest value in a column?

I want to subset a dataframe based on a column with cumulative values (the column "value").我想根据具有累积值的列(“值”列)对 dataframe 进行子集化。
My dummy dataframe is:我的虚拟 dataframe 是:

index  x         y   g1 g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  a  1       29
4      24.25  50.65  a  1       29
5      24.25  50.65  b  1       3
6      24.25  50.65  b  1       4
7      24.25  50.65  b  1       5
8      24.25  50.65  b  1       5

expected output:预期 output:

index  x         y   g1  g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  b  1       3
4      24.25  50.65  b  1       4
5      24.25  50.65  b  1       5

I have already tried:我已经尝试过:

n=1
df_sub= df[df.groupby(['x','y','g1', 'g2']).apply(
                                lambda x: x.nlargest(n, 'value', keep='first')).reset_index(drop=True)

But it does not keep the rows with values lower than maximum.但它不会保留值低于最大值的行。 As far as I know, if you change n to higher values you will get nth highest values but the point is that I have no idea about the range between the first row and the highest value of value.据我所知,如果将 n 更改为更高的值,您将获得第 n 个最高值,但关键是我不知道第一行和最高值之间的范围。 Any help is highly appreciated.非常感谢任何帮助。 Omid.奥米德。

A slightly different approach, filtering where value is the max per group ( groupby transform ) or not duplicated ( duplicated ):一种稍微不同的方法,过滤其中value是每组的最大值( groupby transform )或不重复( duplicated ):

max_m = (
    df.groupby(['x', 'y', 'g1', 'g2'])['value']
        .transform('max')
        .ne(df['value'])
)
dup_m = ~df['value'].duplicated()
filtered_df = df[max_m | dup_m]

filtered_df : filtered_df

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     28
2  24.25  50.65  a   1     29
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

The benefit of this approach is that it will only remove duplicated maximums not other duplicates and order of the frame does not matter:这种方法的好处是它只会删除重复的最大值而不是其他重复,并且帧的顺序无关紧要:

df : df

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # Max
2  24.25  50.65  a   1     25  # Duplicated but not Max
3  24.25  50.65  a   1     28
4  24.25  50.65  a   1     29  # Max (2)
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5
8  24.25  50.65  b   1      5

filtered_df : filtered_df

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # First Max is kept
2  24.25  50.65  a   1     25  # Duplicated but not Max (kept)
3  24.25  50.65  a   1     28
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

Are you perhaps looking for df.drop_duplicates() ?您是否正在寻找df.drop_duplicates()

With subset you can specify on which columns to operate, and with keep which rows to keep.使用subset ,您可以指定要操作的列,并keep要保留的行。

>>> df.drop_duplicates(subset=['value'], keep='first')
   index      x      y g1  g2  value
0      0  24.25  50.65  a   1     25
1      1  24.25  50.65  a   1     28
2      2  24.25  50.65  a   1     29
5      5  24.25  50.65  b   1      3
6      6  24.25  50.65  b   1      4
7      7  24.25  50.65  b   1      5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将这三列 dataframe 转换为一列,该列的值在一行中的最大值 - how can I transform this three column dataframe to a one column with the value of the column in the highest value in a row 查找数据框行中第二高值的列标题 - Finding column header for the second highest value in a row of a dataframe 获取 Dataframe Pandas 中最大值的列和行索引 - Get Column and Row Index for Highest Value in Dataframe Pandas 在数据框的给定列中找到最大值的行索引 - Find row-index of highest value in given column of dataframe 如何向 pandas dataframe 添加一列,该列在一个范围内具有最高值但将其应用于每一行? - How do I add a column to a pandas dataframe which has the highest value in a range but applying it to every row? 如何在数据框中找到具有最小值的列的第一行 - How to find the first row with min value of a column in dataframe 无法对 DataFrame 中的第一列进行子集化 - Cannot subset the first column in a DataFrame Pandas 基于列值的 dataframe 子集的每第 i 行的平均值 - Pandas average every ith row of dataframe subset based on column value 如何提取子集的子集中具有最大行值的 dataframe 的子集? - How to extract subset of a dataframe that has the largest maximum row value within a subset of a subset? 如何使用给定的对或行索引和列标签对 dataframe 进行子集化? - How to subset a dataframe with given pairs or row indices and column labels?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM