I have a below dataframe
id action
================
10 CREATED
10 111
10 222
10 333
10 DONE
10 222
10 UPDATED
777 CREATED
10 333
10 DONE
I would like to create a new column "check" that would be based on data in previous rows in dataframe:
Output:
id action check
================
10 CREATED
10 111
10 222
10 333
10 DONE C
10 222
10 UPDATED
777 CREATED
10 333
10 DONE U
I tried to use multiple if conditions but it did not work for me. Can you pls help?
Consider a more sophisticated sample dataframe for illustration:
# print(df)
id action
10 CREATED
10 111
10 222
10 333
10 DONE
10 222
10 UPDATED
777 CREATED
10 333
10 DONE
777 DONE
10 CREATED
10 DONE
11 UPDATED
11 DONE
Use:
transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]
grouper = (
lambda g: g.groupby(
g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
.transform(transformer)
)
df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''
Explanation:
First we group the dataframe on id
and apply a grouper
function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE
in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE
value in action column. then we use transformer
lambda function to transform each of this spllitted dataframes according to the first value ( CREATED
or UPDATED
) that preceds the DONE
value in action column.
Result:
# print(df)
id action check
0 10 CREATED
1 10 111
2 10 222
3 10 333
4 10 DONE C
5 10 222
6 10 UPDATED
7 777 CREATED
8 10 333
9 10 DONE U
10 777 DONE C
11 10 CREATED
12 10 DONE C
13 11 UPDATED
14 11 DONE U
I don't know whether it's the best answer but I tried to create my own logic to solve this problem.
1) Get the index of rows where the action is done:
m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
df[m]:
id action
4 10 DONE
9 10 DONE
idx:
[4, 9]
2) groupby ID and index of all the rows where Action is either CREATED or UPDATED
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
df[n]:
id action
0 10 CREATED
6 10 UPDATED
7 777 CREATED
n_idx:
Int64Index([0, 6, 7], dtype='int64')
3) Fill new column "check" with empty string:
df['check'] = ''
4) Now you have 2 indexes one is for DONE and another is for CREATED/UPDATED. Now you have to check if previous rows having any CREATED/UPDATED keeping in mind that they should have the same id.
ix = [0] + idx # <-- [0, 4, 9]
for a in list(zip(ix, ix[1:])): # <--- will create range (0,4), (4,9)
for j in (n_idx):
if j in range(a[0], a[1]): # <--- compare if CREATED/UPDATED indexes fall in this range. (checking previous row) and break if get any of them
if (df.iloc[a[1]].id==df.iloc[j].id): # <-- check for id
df.loc[a[1],'check'] = df.loc[j,'action'][0] # <--- assign Action
break
Final Output:
df:
id action check
0 10 CREATED
1 10 111
2 10 222
3 10 333
4 10 DONE C
5 10 222
6 10 UPDATED
7 777 CREATED
8 10 333
9 10 DONE U
FULL CODE:
m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
ix = [0] + idx
df['check'] = ''
for a in list(zip(ix, ix[1:])):
for j in (n_idx):
if (j in range(a[0], a[1]+1)) and (df.iloc[a[1]].id==df.iloc[j].id):
df.loc[a[1],'check'] = df.loc[j,'action'][0]
break
id action check
0 10 CREATED
1 10 111
2 10 DONE C
3 10 333
4 10 DONE
5 10 222
6 10 UPDATED
7 777 CREATED
8 777 DONE C
9 10 DONE
id action check
0 10 CREATED
1 10 111
2 10 DONE C
3 10 333
4 777 UPDATED
5 10 222
6 10 UPDATED
7 777 CREATED
8 777 DONE U
9 10 DONE
A loopy solution, not optimal but does the job.
This assumes that rows in your dataframe are ordered in time, and you have a dataframe with 2 columns ['id', 'action']
and an integer index = range(N)
where N
is the number of columns. Then:
df['check'] = ''
for i, action in zip(df.index, df['action']):
if action == 'DONE':
action_id = df.loc[i, 'id']
prev_action = df.iloc[:i].loc[(df['id'] == action_id) &
(df['action'].isin(['CREATED', 'UPDATED'])), 'action'].iloc[-1]
if prev_action == 'CREATED':
df.loc[i, 'check'] = 'C'
elif prev_action == 'UPDATED':
df.loc[i, 'check'] = 'U'
Basically we loop through actions, find cases when df['action'] == 'DONE'
, then get the id associated with the action and then look at the history of actions for this id previous to the current 'DONE'
event by calling df.iloc[:i]
. Then we narrow down this list to actions which belong to ['CREATED', 'UPDATED']
, and then look at the last action in that list, based on which we assign the value to the 'check'
column.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.