简体   繁体   English

熊猫-如果连续满足条件,则无需迭代即可将值添加到前几行

[英]Pandas - if condition is met in a row, add values to preceding rows without iteration

I am rather new to Pandas and face a quite complicated problem. 我是熊猫的新手,面临一个非常复杂的问题。 As my solution is using many nested iteration-loops, I wonder if there is a faster and more "pandasic" way to do this. 由于我的解决方案使用了许多嵌套的迭代循环,因此我想知道是否存在一种更快,更“泛泛”的方法。

I have a dataframe of events similar to this simplified version: 我有一个类似于此简化版本的事件数据框:

min  sec  isDone       sessionId
2    40   False        1
2    50   False        1
2    55   False        1
2    58   False        1
3    01   False        1
3    12   True         1
5    0    False        1
5    5    False        1
5    15   False        1
5    30   True         1
5    50   False        1
2    0    False        2
2    10   False        2
2    30   False        2
2    50   True         2

Now I want to add a column, that contains the seconds until the next "True" in the "isDone"-column up to a certain amount of seconds - but only within the same "sessionId". 现在,我想添加一列,其中包含直到“ isDone”列中的下一个“ True”为止的秒数,直到特定的秒数-但仅在同一“ sessionId”内。 All other values would remain NaN. 所有其他值将保留为NaN。

For 20 seconds, this would look like this: 在20秒钟内,这看起来像这样:

min  sec  isDone       sessionId  secToDone
2    40   False        1          NaN
2    50   False        1          NaN
2    55   False        1          17
2    58   False        1          14
3    01   False        1          11
3    12   True         1          0
5    0    False        1          NaN
5    5    False        1          NaN
5    15   False        1          15
5    30   True         1          0
5    50   False        1          NaN
2    0    False        2          NaN
2    10   False        2          NaN
2    30   False        2          20
2    50   True         2          0

My solution so far was: 到目前为止,我的解决方案是:

  1. Iterate over sessionIds and select rows. 遍历sessionIds并选择行。
  2. Build a second dataframe df_done only with the "True"-values from this selection. 仅使用此选择中的“真”值构建第二个数据帧df_done。
  3. Iterate over this df_done-Dataframe and select the preceding rows within 'sec' seconds. 遍历此df_done-Dataframe并在“秒”秒内选择前面的行。
  4. Iterate over these preceding rows and write values 遍历前几行并写入值

Here's my code so far (iteration over sessionId is missing as I am testing this only for one session at the moment): 到目前为止,这是我的代码(由于我仅在一个会话中进行测试,因此缺少对sessionId的重复):

def get_preceding(df_dataset,sec=20):
  df_done = df_dataset[(df_dataset['isDone'] == True)]
  for row in df_done.itertuples():
      done_min = getattr(row, 'minute')
      done_sec = getattr(row, 'second')
      if done_sec < sec:
          pre_min = done_min -1
          pre_sec = 60 + done_sec - sec
      else:
          pre_min = done_min
          pre_sec = done_sec - sec


      for r in df_dataset.loc[((pre_min == df_dataset['minute']) & (pre_sec <= df_dataset['second'])) | ((pre_min < df_dataset['minute'])&(df_dataset['minute'] < done_min)) | ((df_dataset['minute'] == done_min) & (df_dataset['second'] <= done_sec))].itertuples():
          if r['minute'] == done_min:
              r['secToDone'] = done_sec - r['second']
          if r['minute'] < done_min:
              r['secToDone'] = 60 - r['second'] + done_sec + ((done_min - r['minute'] - 1)*60)

But this is a lot of iteration and the dataframe is quite big. 但这是很多迭代,并且数据帧很大。 So my question would be: 所以我的问题是:

Is there a faster and more "pandasic" way to do this? 有没有更快,更“泛泛”的方法来做到这一点?

first, you want to combine minutes and seconds into something reasonable: 首先,您需要将分钟和秒合并为合理的内容:

df['t'] = df['min'] * 60 + df.sec

    min  sec  isDone  sessionId    t
0     2   40   False          1  160
1     2   50   False          1  170
2     2   55   False          1  175
3     2   58   False          1  178

then, you want to mark all the times where a True has occurred: 然后,您要标记所有发生True的时间:

df['true_t'] = df[df.isDone].t

    min  sec  isDone  sessionId    t  true_t
0     2   40   False          1  160     NaN
1     2   50   False          1  170     NaN
2     2   55   False          1  175     NaN
3     2   58   False          1  178     NaN
4     3    1   False          1  181     NaN
5     3   12    True          1  192   192.0
6     5    0   False          1  300     NaN

now, the magic of groupby: 现在,groupby的魔力:

df['next_true_t'] = df.groupby('sessionId').true_t.bfill()

    min  sec  isDone  sessionId    t  true_t  next_true_t
0     2   40   False          1  160     NaN        192.0
1     2   50   False          1  170     NaN        192.0
2     2   55   False          1  175     NaN        192.0
3     2   58   False          1  178     NaN        192.0
4     3    1   False          1  181     NaN        192.0
5     3   12    True          1  192   192.0        192.0
6     5    0   False          1  300     NaN        330.0
7     5    5   False          1  305     NaN        330.0
8     5   15   False          1  315     NaN        330.0
9     5   30    True          1  330   330.0        330.0
10    5   50   False          1  350     NaN          NaN
11    2    0   False          2  120     NaN        170.0
12    2   10   False          2  130     NaN        170.0
13    2   30   False          2  150     NaN        170.0
14    2   50    True          2  170   170.0        170.0

now, it's trivial to calculate your diff: 现在,计算差异比较简单:

df['diff'] = df.next_true_t - df.t

    min  sec  isDone  sessionId    t  true_t  next_true_t  diff
0     2   40   False          1  160     NaN        192.0  32.0
1     2   50   False          1  170     NaN        192.0  22.0
2     2   55   False          1  175     NaN        192.0  17.0
3     2   58   False          1  178     NaN        192.0  14.0
4     3    1   False          1  181     NaN        192.0  11.0
5     3   12    True          1  192   192.0        192.0   0.0
6     5    0   False          1  300     NaN        330.0  30.0
7     5    5   False          1  305     NaN        330.0  25.0
8     5   15   False          1  315     NaN        330.0  15.0
9     5   30    True          1  330   330.0        330.0   0.0
10    5   50   False          1  350     NaN          NaN   NaN
11    2    0   False          2  120     NaN        170.0  50.0
12    2   10   False          2  130     NaN        170.0  40.0
13    2   30   False          2  150     NaN        170.0  20.0
14    2   50    True          2  170   170.0        170.0   0.0

i'll leave it up to you to figure out how you want to omit values based on number of seconds, but it's pretty straightforward. 我将由您自己决定要如何根据秒数来忽略值,但这非常简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM