简体   繁体   English

GroupBy - 结合其他条件的Datetime diff()

[英]GroupBy - Datetime diff() combining additional criteria

I have a dataframe that looks like this: 我有一个如下所示的数据框:

In [265]: df_2
Out[265]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

I need to drop all repeat 'failed orders' - except for the last one - in any failed orders sequence. 我需要删除所有重复的“失败的订单” - 除了最后一个 - 在任何失败的订单序列中。

A 'sequence' is a series of failed orders that meet the following criteria: “序列”是一系列符合以下条件的失败订单:

  1. Placed by the same user - identified by 'ID' 由同一用户放置 - 由'ID'
  2. Has 'ORDER_FAILED' == True 'ORDER_FAILED' == True
  3. No consecutive orders are more than 5 minutes away from each other. 没有连续订单彼此超过5分钟。

I was hoping this could be done like this: 我希望这可以这样做:

In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]: 
0         NaN
3     17767.0
4         NaN
5       148.0
7       148.0
8        90.0
9       375.0
11     1533.0
Name: DATETIME, dtype: float64

and then use pd.join to reach to this: 然后使用pd.join来达到这个目的:

In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN
7   B-033  b76cd912ff 2014-10-08 18:44:30         True    148.0
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

However, this is unfortunately not correct. 但是,遗憾的是,这不正确。 Order 7 should have diff == NaN as this is the first order in a series of failed orders, coming after a successful order by this user (that would be order 6 ). 订单7应该有diff == NaN因为这是一系列失败订单中的第一个订单,在此用户成功订购之后(即订单6 )。

I realise the way I'm calculating the diff above is faulty, I haven't managed to find a way to 'reset' the counter after every successful order. 我意识到我计算diff的方式是错误的,我没有设法找到一种方法在每次成功订单后“重置”计数器。

The desired correct outcome would be: 期望的正确结果将是:

In [303]: df_2
Out[303]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## <- successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## <- since this is the first failed order in this sequence of failed orders
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

After this point, I would just mark the orders where diff > 300 with something like this: 在这一点之后,我会用diff > 300标记这些命令:

>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]

which means that, ultimately, the orders that should be kept or discarded are as shown below: 这意味着,最终应该保留或丢弃的订单如下所示:

>> df_2 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN ## STAYS - successful order
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN ## STAYS - successful order
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN ## STAYS - successful order
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0 ## STAYS - last in this sequence of failed orders by this user
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## STAYS - successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN ## STAYS - Successful order
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0 ## STAYS - too long since last failed order by this same user
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN ## STAYS - Successful order

Any help would be greatly appreciated, thanks! 任何帮助将不胜感激,谢谢!

I'll start with sorting by ID and DATETIME (ascending): 我将从ID和DATETIME(升序)排序开始:

df1 = df.sort_values(by = ['ID','DATETIME'])

Now, if I understand correctly, we need to remove all orders that satisfy the conjunction of the following conditions (by "next" I understand "in the next row"): 现在,如果我理解正确,我们需要删除满足以下条件的所有顺序(通过“下一行”,我理解“在下一行”):

  • the order failed 订单失败了

  • the next order failed 下一个订单失败了

  • the time difference between the order and the next one is at most 300 s 订单与下一订单之间的时差最多为300秒

  • (and additionally) the ID is the same as the next ID (otherwise it was the very last order) (另外)ID与下一个ID相同(否则它是最后一个订单)

My idea is simple: to add appropriate columns so that each row contains all data needed to evaluate these conditions. 我的想法很简单:添加适当的列,以便每行包含评估这些条件所需的所有数据。

This one adds the "next ID" and the "next order" fields: 这个添加了“下一个ID”和“下一个订单”字段:

df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)

and this one is responsible for the difference in time to the next order: 这个是负责下一个订单的时间差异:

df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()

(the relevant differences with period=-1 will be negative, hence the minus sign). (与period = -1的相关差异将为负,因此为负号)。

I believe the rest is already quite straightforward. 我相信其余的已经非常简单了。

Update: By the way, we can create a bool mask even without adding new columns to the data frame: 更新:顺便说一下,即使不向数据框添加新列,我们也可以创建一个bool掩码:

mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...

UPDATE UPDATE

There is no real need to order by ID and the overall solution would in fact be somewhat cleaner if groupby() was used properly. 没有真正需要按ID排序,如果正确使用groupby() ,整体解决方案实际上会更加清晰。 Here's how it was done at the end, after the suggestions above. 根据上述建议,最后是如何完成的。

In [478]: df_3
Out[478]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)

In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()

In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)

In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

The correct orders - as per description by OP - are indeed dropped! 正确的订单 - 按照OP的描述 - 确实被取消了!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM