I want to subset a dataframe based on a column with cumulative values (the column "value").
My dummy dataframe is:
index x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
3 24.25 50.65 a 1 29
4 24.25 50.65 a 1 29
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
8 24.25 50.65 b 1 5
expected output:
index x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
3 24.25 50.65 b 1 3
4 24.25 50.65 b 1 4
5 24.25 50.65 b 1 5
I have already tried:
n=1
df_sub= df[df.groupby(['x','y','g1', 'g2']).apply(
lambda x: x.nlargest(n, 'value', keep='first')).reset_index(drop=True)
But it does not keep the rows with values lower than maximum. As far as I know, if you change n to higher values you will get nth highest values but the point is that I have no idea about the range between the first row and the highest value of value. Any help is highly appreciated. Omid.
A slightly different approach, filtering where value
is the max per group ( groupby transform
) or not duplicated ( duplicated
):
max_m = (
df.groupby(['x', 'y', 'g1', 'g2'])['value']
.transform('max')
.ne(df['value'])
)
dup_m = ~df['value'].duplicated()
filtered_df = df[max_m | dup_m]
filtered_df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 28
2 24.25 50.65 a 1 29
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
The benefit of this approach is that it will only remove duplicated maximums not other duplicates and order of the frame does not matter:
df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 29 # Max
2 24.25 50.65 a 1 25 # Duplicated but not Max
3 24.25 50.65 a 1 28
4 24.25 50.65 a 1 29 # Max (2)
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
8 24.25 50.65 b 1 5
filtered_df
:
x y g1 g2 value
0 24.25 50.65 a 1 25
1 24.25 50.65 a 1 29 # First Max is kept
2 24.25 50.65 a 1 25 # Duplicated but not Max (kept)
3 24.25 50.65 a 1 28
5 24.25 50.65 b 1 3
6 24.25 50.65 b 1 4
7 24.25 50.65 b 1 5
Are you perhaps looking for df.drop_duplicates() ?
With subset
you can specify on which columns to operate, and with keep
which rows to keep.
>>> df.drop_duplicates(subset=['value'], keep='first')
index x y g1 g2 value
0 0 24.25 50.65 a 1 25
1 1 24.25 50.65 a 1 28
2 2 24.25 50.65 a 1 29
5 5 24.25 50.65 b 1 3
6 6 24.25 50.65 b 1 4
7 7 24.25 50.65 b 1 5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.