简体   繁体   English

pandas dataframe 列基于之前的行

[英]pandas dataframe column based on previous rows

I have a below dataframe我有一个下面的 dataframe

         id  action   
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      

I would like to create a new column "check" that would be based on data in previous rows in dataframe:我想创建一个新列“检查”,该列将基于 dataframe 中先前行中的数据:

  1. Find cell in action column = "DONE"在操作列中查找单元格 =“完成”
  2. Search for the first CREATED or UPDATED with the same id in previous rows, before DONE.在 DONE 之前搜索先前行中具有相同 id 的第一个 CREATED 或 UPDATED。 In case its CREATED then put C in case UPDATED put U.如果它已创建,则放入 C 以防已更新放入 U。

Output: Output:

         id  action   check
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      C
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      U

I tried to use multiple if conditions but it did not work for me.我尝试使用多个 if 条件,但它对我不起作用。 Can you pls help?你能帮忙吗?

Consider a more sophisticated sample dataframe for illustration:考虑一个更复杂的示例 dataframe 来说明:

# print(df)
id  action   
10   CREATED   
10   111
10   222
10   333
10   DONE      
10   222
10   UPDATED   
777  CREATED    
10   333
10   DONE
777  DONE
10   CREATED
10   DONE
11   UPDATED
11   DONE     

Use:利用:

transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]

grouper = (
    lambda g: g.groupby(
        g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
    .transform(transformer)
)

df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''

Explanation:解释:

First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE value in action column. First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated从另一个由操作列中的DONE值。 then we use transformer lambda function to transform each of this spllitted dataframes according to the first value ( CREATED or UPDATED ) that preceds the DONE value in action column.然后我们使用transformer器 lambda function 根据第一个值( CREATEDUPDATED )转换每个拆分的数据帧,该值在 action 列中的DONE值之前。

Result:结果:

# print(df)
     id   action check
0    10  CREATED      
1    10      111      
2    10      222      
3    10      333      
4    10     DONE     C
5    10      222      
6    10  UPDATED      
7   777  CREATED      
8    10      333      
9    10     DONE     U
10  777     DONE     C
11   10  CREATED      
12   10     DONE     C
13   11  UPDATED      
14   11     DONE     U

I don't know whether it's the best answer but I tried to create my own logic to solve this problem.我不知道这是否是最好的答案,但我试图创建自己的逻辑来解决这个问题。

1) Get the index of rows where the action is done: 1) 获取执行操作的行的索引:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()

df[m]: df[米]:

    id  action
4   10  DONE
9   10  DONE

idx:编号:

[4, 9]

2) groupby ID and index of all the rows where Action is either CREATED or UPDATED 2) groupby ID 和 Action 被创建或更新的所有行的索引

n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)

n_idx = df[n].index

df[n]: df[n]:

    id  action
0   10  CREATED
6   10  UPDATED
7   777 CREATED

n_idx: n_idx:

Int64Index([0, 6, 7], dtype='int64')

3) Fill new column "check" with empty string: 3)用空字符串填充新列“check”:

df['check'] = ''

4) Now you have 2 indexes one is for DONE and another is for CREATED/UPDATED. 4) 现在您有 2 个索引,一个用于 DONE,另一个用于创建/更新。 Now you have to check if previous rows having any CREATED/UPDATED keeping in mind that they should have the same id.现在你必须检查之前的行是否有任何创建/更新,记住它们应该有相同的 id。

ix = [0] + idx # <-- [0, 4, 9]
for a in list(zip(ix, ix[1:])): # <--- will create range (0,4), (4,9)
    for j in (n_idx):
        if j in range(a[0], a[1]): # <--- compare if CREATED/UPDATED indexes fall in this range. (checking previous row) and break if get any of them
            if (df.iloc[a[1]].id==df.iloc[j].id): # <--  check for id
                df.loc[a[1],'check'] = df.loc[j,'action'][0] # <--- assign Action
                break

Final Output:最终 Output:

df:东风:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  222 
3   10  333 
4   10  DONE    C
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   10  333 
9   10  DONE    U

FULL CODE:完整代码:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
ix = [0] + idx
df['check'] = ''

for a in list(zip(ix, ix[1:])):
    for j in (n_idx):
        if (j in range(a[0], a[1]+1)) and (df.iloc[a[1]].id==df.iloc[j].id):
            df.loc[a[1],'check'] = df.loc[j,'action'][0]
            break

Sample Data with result:带有结果的样本数据:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   10  DONE    
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    C
9   10  DONE    

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   777 UPDATED 
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    U
9   10  DONE    

A loopy solution, not optimal but does the job.一个循环的解决方案,不是最佳的,但可以完成工作。

This assumes that rows in your dataframe are ordered in time, and you have a dataframe with 2 columns ['id', 'action'] and an integer index = range(N) where N is the number of columns.这假设您的 dataframe 中的行是按时间排序的,并且您有一个N有 2 列['id', 'action']和一个range(N) Then:然后:

df['check'] = ''
for i, action in zip(df.index, df['action']):
    if action == 'DONE':
        action_id = df.loc[i, 'id']
        prev_action = df.iloc[:i].loc[(df['id'] == action_id) & 
                      (df['action'].isin(['CREATED', 'UPDATED'])), 'action'].iloc[-1]
        if prev_action == 'CREATED':
            df.loc[i, 'check'] = 'C'
        elif prev_action == 'UPDATED':
            df.loc[i, 'check'] = 'U'

Basically we loop through actions, find cases when df['action'] == 'DONE' , then get the id associated with the action and then look at the history of actions for this id previous to the current 'DONE' event by calling df.iloc[:i] .基本上我们遍历动作,找到df['action'] == 'DONE'时的情况,然后获取与动作关联的 id,然后通过调用查看在当前'DONE'事件之前此 id 的动作历史df.iloc[:i] Then we narrow down this list to actions which belong to ['CREATED', 'UPDATED'] , and then look at the last action in that list, based on which we assign the value to the 'check' column.然后我们将该列表缩小到属于['CREATED', 'UPDATED']的操作,然后查看该列表中的最后一个操作,根据该操作我们将值分配给'check'列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM