pandas dataframe column based on previous rows

Question

I have a below dataframe

         id  action   
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE

I would like to create a new column "check" that would be based on data in previous rows in dataframe:

Find cell in action column = "DONE"
Search for the first CREATED or UPDATED with the same id in previous rows, before DONE. In case its CREATED then put C in case UPDATED put U.

Output:

         id  action   check
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      C
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      U

I tried to use multiple if conditions but it did not work for me. Can you pls help?

Answer 1

Consider a more sophisticated sample dataframe for illustration:

# print(df)
id  action   
10   CREATED   
10   111
10   222
10   333
10   DONE      
10   222
10   UPDATED   
777  CREATED    
10   333
10   DONE
777  DONE
10   CREATED
10   DONE
11   UPDATED
11   DONE

Use:

transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]

grouper = (
    lambda g: g.groupby(
        g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
    .transform(transformer)
)

df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''

Explanation:

First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE value in action column. then we use transformer lambda function to transform each of this spllitted dataframes according to the first value ( CREATED or UPDATED ) that preceds the DONE value in action column.

Result:

# print(df)
     id   action check
0    10  CREATED      
1    10      111      
2    10      222      
3    10      333      
4    10     DONE     C
5    10      222      
6    10  UPDATED      
7   777  CREATED      
8    10      333      
9    10     DONE     U
10  777     DONE     C
11   10  CREATED      
12   10     DONE     C
13   11  UPDATED      
14   11     DONE     U

Answer 2

I don't know whether it's the best answer but I tried to create my own logic to solve this problem.

1) Get the index of rows where the action is done:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()

df[m]:

    id  action
4   10  DONE
9   10  DONE

idx:

[4, 9]

2) groupby ID and index of all the rows where Action is either CREATED or UPDATED

n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)

n_idx = df[n].index

df[n]:

    id  action
0   10  CREATED
6   10  UPDATED
7   777 CREATED

n_idx:

Int64Index([0, 6, 7], dtype='int64')

3) Fill new column "check" with empty string:

df['check'] = ''

4) Now you have 2 indexes one is for DONE and another is for CREATED/UPDATED. Now you have to check if previous rows having any CREATED/UPDATED keeping in mind that they should have the same id.

ix = [0] + idx # <-- [0, 4, 9]
for a in list(zip(ix, ix[1:])): # <--- will create range (0,4), (4,9)
    for j in (n_idx):
        if j in range(a[0], a[1]): # <--- compare if CREATED/UPDATED indexes fall in this range. (checking previous row) and break if get any of them
            if (df.iloc[a[1]].id==df.iloc[j].id): # <--  check for id
                df.loc[a[1],'check'] = df.loc[j,'action'][0] # <--- assign Action
                break

Final Output:

df:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  222 
3   10  333 
4   10  DONE    C
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   10  333 
9   10  DONE    U

FULL CODE:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
ix = [0] + idx
df['check'] = ''

for a in list(zip(ix, ix[1:])):
    for j in (n_idx):
        if (j in range(a[0], a[1]+1)) and (df.iloc[a[1]].id==df.iloc[j].id):
            df.loc[a[1],'check'] = df.loc[j,'action'][0]
            break

Sample Data with result:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   10  DONE    
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    C
9   10  DONE

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   777 UPDATED 
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    U
9   10  DONE

Answer 3

A loopy solution, not optimal but does the job.

This assumes that rows in your dataframe are ordered in time, and you have a dataframe with 2 columns ['id', 'action'] and an integer index = range(N) where N is the number of columns. Then:

df['check'] = ''
for i, action in zip(df.index, df['action']):
    if action == 'DONE':
        action_id = df.loc[i, 'id']
        prev_action = df.iloc[:i].loc[(df['id'] == action_id) & 
                      (df['action'].isin(['CREATED', 'UPDATED'])), 'action'].iloc[-1]
        if prev_action == 'CREATED':
            df.loc[i, 'check'] = 'C'
        elif prev_action == 'UPDATED':
            df.loc[i, 'check'] = 'U'

Basically we loop through actions, find cases when df['action'] == 'DONE' , then get the id associated with the action and then look at the history of actions for this id previous to the current 'DONE' event by calling df.iloc[:i] . Then we narrow down this list to actions which belong to ['CREATED', 'UPDATED'] , and then look at the last action in that list, based on which we assign the value to the 'check' column.

pandas dataframe column based on previous rows

Question

3 answers

solution1
1 ACCPTED 2020-06-12 18:04:42

solution2
0 2020-06-12 17:52:04

Sample Data with result:

solution3
0 2020-06-12 17:52:47

pandas dataframe column based on previous rows

Question

3 answers

solution1 1 ACCPTED 2020-06-12 18:04:42

solution2 0 2020-06-12 17:52:04

Sample Data with result:

solution3 0 2020-06-12 17:52:47

solution1
1 ACCPTED 2020-06-12 18:04:42

solution2
0 2020-06-12 17:52:04

solution3
0 2020-06-12 17:52:47