简体   繁体   English

如果所有行都遵循一个序列,如何 select 来自 dataframe 的行组

[英]how to select group of rows from a dataframe if all rows follow a sequence

I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process.我目前正在研究一个 dataframe,它的进程(基于 ID)可能会或不会到达进程的末尾。 The end of the process is defined as the activity which has index=6.流程结束定义为索引=6 的活动。 What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).我需要做的是根据它们已完成的事实过滤这些流程(ID),这意味着所有 6 个活动都已完成(因此在此过程中,我们将有索引等于 1、2、3 的活动, 4,5 和 6 按此特定顺序)。

the dataframe is structured as follows: dataframe 的结构如下:

ID          A  index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
2   activity7      1    
2   activity8      2    
2   activity9      3    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

And the resulting dataframe should be:结果 dataframe 应该是:

ID          A   index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().我尝试使用 sum(),创建一个新列“a”并检查每个组的总和是否大于 20(这意味着取 sum() 至少为 21 的组,即1,2,3,4,5,6) 与 function gt() 的总和。

df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)] 

Probably this isn't the best approach, so also other approaches are more than welcome.可能这不是最好的方法,所以其他方法也很受欢迎。 Any idea on how to select rows based on this condition?关于如何根据这种情况对 select 行有任何想法吗?

this may not be the fastest method, especially on a large dataframe, but it does the job这可能不是最快的方法,尤其是在大型 dataframe 上,但它可以完成工作

df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]

Or this other variation:或其他变体:

df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]

Output: Output:


ID  A   index
0   1   activity1   1
1   1   activity2   2
2   1   activity3   3
3   1   activity4   4
4   1   activity5   5
5   1   activity6   6
9   3   activity10  1
10  3   activity11  2
11  3   activity12  3
12  3   activity13  4
13  3   activity14  5
14  3   activity15  6

Another possible solution:另一种可能的解决方案:

out = (df.groupby('ID')
       .filter(lambda g: (len(g['index']) == 6) and 
       (g['index'].eq([*range(1,7)]).all())))

print(out)

   ID           A  index
0    1   activity1      1
1    1   activity2      2
2    1   activity3      3
3    1   activity4      4
4    1   activity5      5
5    1   activity6      6
9    3  activity10      1
10   3  activity11      2
11   3  activity12      3
12   3  activity13      4
13   3  activity14      5
14   3  activity15      6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM