简体   繁体   中英

Perform addition calculation in a column based on another (grouping by a specific column)

I have a dataframe, df, where I would like to perform an addition based on values in a different column. The addition calculation is based on the id value.

Data

id  date    value   start   status
aa  Q1.22   hello   80  
aa  Q2.22   hi     -200     2000
aa  Q3.22   hey    -400 
aa  Q4.22   sure   -500 
bb  Q3.23   ok      30  
bb  Q4.23   sure    10      100

Desired

id  date    value   start   status  update
aa  Q1.22   hello   80      
aa  Q2.22   hi     -200     2000    1800
aa  Q3.22   hey    -400             1400
aa  Q4.22   sure   -500             900
bb  Q3.23   ok      30      
bb  Q4.23   sure    10      100     110

Logic

The addition will begin at the first display of a single value. For example we see that the number 2000 is a single non consecutive value. The start point of the addition calculation will begin here. Similar to the 100 value (it is an ending value)

We are just adding the values in the 'start' column to the value in the 'status' column. The adding ends with the id. The adding calculation is grouped by id.

Doing

df = df.groupby(["id"]).sum()

df['update'] = df.groupby(["id"]).sum()['start']

However,

I am still researching this, any suggestion is appreciated.

  1. to_numeric to convert status to numeric with NaN values
  2. groupby ffill the status column this will fill move numeric values forward (leaving starting NaN as NaN
  3. Series.mask to remove unwanted values from start then groupby cumsum
  4. add to the filled status columns.
df['status'] = pd.to_numeric(df['status'], errors='coerce')
status = df.groupby('id')['status'].ffill()
df['update'] = (
        status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()
)

df :

   id   date  value  start  status  update
0  aa  Q1.22  hello     80     NaN     NaN
1  aa  Q2.22     hi   -200  2000.0  1800.0
2  aa  Q3.22    hey   -400     NaN  1400.0
3  aa  Q4.22   sure   -500     NaN   900.0
4  bb  Q3.23     ok     30     NaN     NaN
5  bb  Q4.23   sure     10   100.0   110.0

Optionally, DataFrame.fillna to add back the empty strings:

df = df.fillna('')

df :

   id   date  value  start  status  update
0  aa  Q1.22  hello     80                
1  aa  Q2.22     hi   -200  2000.0  1800.0
2  aa  Q3.22    hey   -400          1400.0
3  aa  Q4.22   sure   -500           900.0
4  bb  Q3.23     ok     30                
5  bb  Q4.23   sure     10   100.0   110.0

Breakdown of Steps:

(Step 2 Forward Fill status in groups)

df.groupby('id')['status'].ffill()

0       NaN
1    2000.0
2    2000.0
3    2000.0
4       NaN
5     100.0
Name: status, dtype: float64

(Step 3.1 Remove unwanted values from start )

df['start'].mask(status.isna())

0      NaN
1   -200.0
2   -400.0
3   -500.0
4      NaN
5     10.0
Name: start, dtype: float64

(Step 3.2 cumsum in groups the kept values)

df['start'].mask(status.isna()).groupby(df['id']).cumsum()

0       NaN
1    -200.0
2    -600.0
3   -1100.0
4       NaN
5      10.0
Name: start, dtype: float64

(Step 4 add to forward filled status )

status + df['start'].mask(status.isna()).groupby(df['id']).cumsum()

0       NaN
1    1800.0
2    1400.0
3     900.0
4       NaN
5     110.0
dtype: float64

Try something new

idx = df.groupby('id').status.transform('idxmax')
df['new'] = df.status.add(df.start,fill_value=0)[df.index>=idx].groupby(df['id']).cumsum()
df
Out[270]: 
   id   date  value  start  status     new
0  aa  Q1.22  hello     80     NaN     NaN
1  aa  Q2.22     hi   -200  2000.0  1800.0
2  aa  Q3.22    hey   -400     NaN  1400.0
3  aa  Q4.22   sure   -500     NaN   900.0
4  bb  Q3.23     ok     30     NaN     NaN
5  bb  Q4.23   sure     10   100.0   110.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM