简体   繁体   中英

how to subset dataframe from first row to the highest value in a column?

I want to subset a dataframe based on a column with cumulative values (the column "value").
My dummy dataframe is:

index  x         y   g1 g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  a  1       29
4      24.25  50.65  a  1       29
5      24.25  50.65  b  1       3
6      24.25  50.65  b  1       4
7      24.25  50.65  b  1       5
8      24.25  50.65  b  1       5

expected output:

index  x         y   g1  g2     value
0      24.25  50.65  a  1        25  
1      24.25  50.65  a  1        28
2      24.25  50.65  a  1       29
3      24.25  50.65  b  1       3
4      24.25  50.65  b  1       4
5      24.25  50.65  b  1       5

I have already tried:

n=1
df_sub= df[df.groupby(['x','y','g1', 'g2']).apply(
                                lambda x: x.nlargest(n, 'value', keep='first')).reset_index(drop=True)

But it does not keep the rows with values lower than maximum. As far as I know, if you change n to higher values you will get nth highest values but the point is that I have no idea about the range between the first row and the highest value of value. Any help is highly appreciated. Omid.

A slightly different approach, filtering where value is the max per group ( groupby transform ) or not duplicated ( duplicated ):

max_m = (
    df.groupby(['x', 'y', 'g1', 'g2'])['value']
        .transform('max')
        .ne(df['value'])
)
dup_m = ~df['value'].duplicated()
filtered_df = df[max_m | dup_m]

filtered_df :

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     28
2  24.25  50.65  a   1     29
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

The benefit of this approach is that it will only remove duplicated maximums not other duplicates and order of the frame does not matter:

df :

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # Max
2  24.25  50.65  a   1     25  # Duplicated but not Max
3  24.25  50.65  a   1     28
4  24.25  50.65  a   1     29  # Max (2)
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5
8  24.25  50.65  b   1      5

filtered_df :

       x      y g1  g2  value
0  24.25  50.65  a   1     25
1  24.25  50.65  a   1     29  # First Max is kept
2  24.25  50.65  a   1     25  # Duplicated but not Max (kept)
3  24.25  50.65  a   1     28
5  24.25  50.65  b   1      3
6  24.25  50.65  b   1      4
7  24.25  50.65  b   1      5

Are you perhaps looking for df.drop_duplicates() ?

With subset you can specify on which columns to operate, and with keep which rows to keep.

>>> df.drop_duplicates(subset=['value'], keep='first')
   index      x      y g1  g2  value
0      0  24.25  50.65  a   1     25
1      1  24.25  50.65  a   1     28
2      2  24.25  50.65  a   1     29
5      5  24.25  50.65  b   1      3
6      6  24.25  50.65  b   1      4
7      7  24.25  50.65  b   1      5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM