[英]GroupBy - Datetime diff() combining additional criteria
I have a dataframe that looks like this: 我有一个如下所示的数据框:
In [265]: df_2
Out[265]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
I need to drop all repeat 'failed orders' - except for the last one - in any failed orders sequence. 我需要删除所有重复的“失败的订单” - 除了最后一个 - 在任何失败的订单序列中。
A 'sequence' is a series of failed orders that meet the following criteria:
“序列”是一系列符合以下条件的失败订单:
- Placed by the same user - identified by
'ID'
由同一用户放置 - 由
'ID'
- Has
'ORDER_FAILED' == True
'ORDER_FAILED' == True
- No consecutive orders are more than 5 minutes away from each other.
没有连续订单彼此超过5分钟。
I was hoping this could be done like this: 我希望这可以这样做:
In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]:
0 NaN
3 17767.0
4 NaN
5 148.0
7 148.0
8 90.0
9 375.0
11 1533.0
Name: DATETIME, dtype: float64
and then use pd.join
to reach to this: 然后使用
pd.join
来达到这个目的:
In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN
7 B-033 b76cd912ff 2014-10-08 18:44:30 True 148.0
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN
However, this is unfortunately not correct. 但是,遗憾的是,这不正确。 Order
7
should have diff == NaN
as this is the first order in a series of failed orders, coming after a successful order by this user (that would be order 6
). 订单
7
应该有diff == NaN
因为这是一系列失败订单中的第一个订单,在此用户成功订购之后(即订单6
)。
I realise the way I'm calculating the diff
above is faulty, I haven't managed to find a way to 'reset' the counter after every successful order. 我意识到我计算
diff
的方式是错误的,我没有设法找到一种方法在每次成功订单后“重置”计数器。
The desired correct outcome would be: 期望的正确结果将是:
In [303]: df_2
Out[303]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## <- successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## <- since this is the first failed order in this sequence of failed orders
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN
After this point, I would just mark the orders where diff > 300
with something like this: 在这一点之后,我会用
diff > 300
标记这些命令:
>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]
which means that, ultimately, the orders that should be kept or discarded are as shown below: 这意味着,最终应该保留或丢弃的订单如下所示:
>> df_2
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN ## STAYS - successful order
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN ## STAYS - successful order
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN ## STAYS - successful order
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0 ## STAYS - last in this sequence of failed orders by this user
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## STAYS - successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN ## STAYS - Successful order
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0 ## STAYS - too long since last failed order by this same user
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN ## STAYS - Successful order
Any help would be greatly appreciated, thanks! 任何帮助将不胜感激,谢谢!
I'll start with sorting by ID and DATETIME (ascending): 我将从ID和DATETIME(升序)排序开始:
df1 = df.sort_values(by = ['ID','DATETIME'])
Now, if I understand correctly, we need to remove all orders that satisfy the conjunction of the following conditions (by "next" I understand "in the next row"): 现在,如果我理解正确,我们需要删除满足以下条件的所有顺序(通过“下一行”,我理解“在下一行”):
the order failed 订单失败了
the next order failed 下一个订单失败了
the time difference between the order and the next one is at most 300 s 订单与下一订单之间的时差最多为300秒
(and additionally) the ID is the same as the next ID (otherwise it was the very last order) (另外)ID与下一个ID相同(否则它是最后一个订单)
My idea is simple: to add appropriate columns so that each row contains all data needed to evaluate these conditions. 我的想法很简单:添加适当的列,以便每行包含评估这些条件所需的所有数据。
This one adds the "next ID" and the "next order" fields: 这个添加了“下一个ID”和“下一个订单”字段:
df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)
and this one is responsible for the difference in time to the next order: 这个是负责下一个订单的时间差异:
df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()
(the relevant differences with period=-1 will be negative, hence the minus sign). (与period = -1的相关差异将为负,因此为负号)。
I believe the rest is already quite straightforward. 我相信其余的已经非常简单了。
Update: By the way, we can create a bool mask even without adding new columns to the data frame: 更新:顺便说一下,即使不向数据框添加新列,我们也可以创建一个bool掩码:
mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...
UPDATE UPDATE
There is no real need to order by ID and the overall solution would in fact be somewhat cleaner if groupby()
was used properly. 没有真正需要按ID排序,如果正确使用
groupby()
,整体解决方案实际上会更加清晰。 Here's how it was done at the end, after the suggestions above. 根据上述建议,最后是如何完成的。
In [478]: df_3
Out[478]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)
In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()
In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)
In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
The correct orders - as per description by OP - are indeed dropped! 正确的订单 - 按照OP的描述 - 确实被取消了!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.