简体   繁体   中英

how to select group of rows from a dataframe if all rows follow a sequence

I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).

the dataframe is structured as follows:

ID          A  index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
2   activity7      1    
2   activity8      2    
2   activity9      3    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

And the resulting dataframe should be:

ID          A   index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().

df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)] 

Probably this isn't the best approach, so also other approaches are more than welcome. Any idea on how to select rows based on this condition?

this may not be the fastest method, especially on a large dataframe, but it does the job

df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]

Or this other variation:

df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]

Output:


ID  A   index
0   1   activity1   1
1   1   activity2   2
2   1   activity3   3
3   1   activity4   4
4   1   activity5   5
5   1   activity6   6
9   3   activity10  1
10  3   activity11  2
11  3   activity12  3
12  3   activity13  4
13  3   activity14  5
14  3   activity15  6

Another possible solution:

out = (df.groupby('ID')
       .filter(lambda g: (len(g['index']) == 6) and 
       (g['index'].eq([*range(1,7)]).all())))

print(out)

   ID           A  index
0    1   activity1      1
1    1   activity2      2
2    1   activity3      3
3    1   activity4      4
4    1   activity5      5
5    1   activity6      6
9    3  activity10      1
10   3  activity11      2
11   3  activity12      3
12   3  activity13      4
13   3  activity14      5
14   3  activity15      6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM