简体   繁体   中英

Pandas: set all values that are <= 0 to the maximum value in a column by group, but only after the last positive value in that group

I am trying to set all values that are <= 0, by group, to the maximum value in that group, but only after the last positive value. That is, all values <=0 in the group that come before the last positive value must be ignored. Example:

data = {'group':['A', 'A', 'A', 'A', 'A', 'B', 'B', 
                'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'], 
                 'value':[3, 0, 8, 7, 0, -1, 0, 9, -2, 0, 0, 2, 0, 5, 0, 1]} 
df = pd.DataFrame(data)
df

  group  value
0   A      3
1   A      0
2   A      8
3   A      7
4   A      0
5   B     -1
6   B      0
7   B      9
8   B     -2
9   B      0
10  B      0
11  C      2
12  C      0
13  C      5
14  C      0
15  C      1

and the result must be:

  group  value
0   A      3
1   A      0
2   A      8
3   A      7
4   A      8
5   B     -1
6   B      0
7   B      9
8   B      9
9   B      9
10  B      9
11  C      2
12  C      0
13  C      5
14  C      0
15  C      1

Thanks to advise

Start by adding a column to identify the rows with negative value (more precisely <= 0):

df['neg'] = (df['value'] <= 0)

Then, for each group, find the sequence of last few entries that have 'neg' set to True and that are contiguous. In order to do that, reverse the order of the DataFrame (with .iloc[::-1] ) and then use .cumprod() on the 'neg' column. cumprod() will treat True as 1 and False as 0, so the cumulative product will be 1 as long as you're seeing all True's and will become and stay 0 as soon as you see the first False. Since we reversed the order, we're going backwards from the end, so we're finding the sequence of True's at the end.

df['upd'] = df.iloc[::-1].groupby('group')['neg'].cumprod().astype(bool)

Now that we know which entries to update, we just need to know what to update them to, which is the max of the group. We can use transform('max') on a groupby to get that value and then all that's left is to do the actual update of 'value' where 'upd' is set:

df.loc[df['upd'], 'value'] = df.groupby('group')['value'].transform('max')

We can finish by dropping the two auxiliary columns we used in the process:

df = df.drop(['neg', 'upd'], axis=1)

The result I got matches your expected result.


UPDATE: Or do the whole operation in a single (long!) line, without adding any auxiliary columns to the original DataFrame:

df.loc[
    df.assign(
        neg=(df['value'] <= 0)
    ).iloc[::-1].groupby(
        'group'
    )['neg'].cumprod().astype(bool),
    'value'
] = df.groupby(
    'group'
)['value'].transform('max')

You can do it this way.

(df.loc[(df.assign(m=df['value'].lt(0)).groupby(['group'], sort=False)['m'].transform('any')) &
    (df.index>=df.groupby('group')['value'].transform('idxmin')),'value']) = np.nan
df['value']=df.groupby('group').ffill()
df

Output

group   value
0   A   3.0
1   A   0.0
2   A   8.0
3   A   7.0
4   A   0.0
5   B   -1.0
6   B   0.0
7   B   9.0
8   B   9.0
9   B   9.0
10  B   9.0
11  C   2.0
12  C   0.0
13  C   5.0
14  C   0.0
15  C   1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM