I have a dataframe that looks like this:
In [265]: df_2
Out[265]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
I need to drop all repeat 'failed orders' - except for the last one - in any failed orders sequence.
A 'sequence' is a series of failed orders that meet the following criteria:
- Placed by the same user - identified by
'ID'
- Has
'ORDER_FAILED' == True
- No consecutive orders are more than 5 minutes away from each other.
I was hoping this could be done like this:
In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]:
0 NaN
3 17767.0
4 NaN
5 148.0
7 148.0
8 90.0
9 375.0
11 1533.0
Name: DATETIME, dtype: float64
and then use pd.join
to reach to this:
In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN
7 B-033 b76cd912ff 2014-10-08 18:44:30 True 148.0
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN
However, this is unfortunately not correct. Order 7
should have diff == NaN
as this is the first order in a series of failed orders, coming after a successful order by this user (that would be order 6
).
I realise the way I'm calculating the diff
above is faulty, I haven't managed to find a way to 'reset' the counter after every successful order.
The desired correct outcome would be:
In [303]: df_2
Out[303]:
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## <- successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## <- since this is the first failed order in this sequence of failed orders
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN
After this point, I would just mark the orders where diff > 300
with something like this:
>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]
which means that, ultimately, the orders that should be kept or discarded are as shown below:
>> df_2
A ID DATETIME ORDER_FAILED diff
0 B-028 b76cd912ff 2014-10-08 13:43:27 True NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False NaN ## STAYS - successful order
2 B-076 1a682034f8 2014-10-08 14:29:01 False NaN ## STAYS - successful order
3 B-023 b76cd912ff 2014-10-08 18:39:34 True 17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True NaN ## STAYS - successful order
5 B-025 b76cd912ff 2014-10-08 18:42:02 True 148.0 ## STAYS - last in this sequence of failed orders by this user
6 B-026 b76cd912ff 2014-10-08 18:42:41 False NaN ## STAYS - successful order
7 B-033 b76cd912ff 2014-10-08 18:44:30 True NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8 B-032 b76cd912ff 2014-10-08 18:46:00 True 90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9 B-037 b76cd912ff 2014-10-08 18:52:15 True 375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10 B-046 db959faf02 2014-10-08 18:59:59 False NaN ## STAYS - Successful order
11 B-053 b76cd912ff 2014-10-08 19:17:48 True 1533.0 ## STAYS - too long since last failed order by this same user
12 B-065 b76cd912ff 2014-10-08 19:21:38 False NaN ## STAYS - Successful order
Any help would be greatly appreciated, thanks!
I'll start with sorting by ID and DATETIME (ascending):
df1 = df.sort_values(by = ['ID','DATETIME'])
Now, if I understand correctly, we need to remove all orders that satisfy the conjunction of the following conditions (by "next" I understand "in the next row"):
the order failed
the next order failed
the time difference between the order and the next one is at most 300 s
(and additionally) the ID is the same as the next ID (otherwise it was the very last order)
My idea is simple: to add appropriate columns so that each row contains all data needed to evaluate these conditions.
This one adds the "next ID" and the "next order" fields:
df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)
and this one is responsible for the difference in time to the next order:
df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()
(the relevant differences with period=-1 will be negative, hence the minus sign).
I believe the rest is already quite straightforward.
Update: By the way, we can create a bool mask even without adding new columns to the data frame:
mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...
UPDATE
There is no real need to order by ID and the overall solution would in fact be somewhat cleaner if groupby()
was used properly. Here's how it was done at the end, after the suggestions above.
In [478]: df_3
Out[478]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
3 B-023 b76cd912ff 2014-10-08 18:39:34 True
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
7 B-033 b76cd912ff 2014-10-08 18:44:30 True
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)
In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()
In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)
In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]:
A ID DATETIME ORDER_FAILED
0 B-028 b76cd912ff 2014-10-08 13:43:27 True
1 B-054 4a57ed0b02 2014-10-08 14:26:19 False
2 B-076 1a682034f8 2014-10-08 14:29:01 False
4 B-024 f88g8d7sds 2014-10-08 18:40:18 True
5 B-025 b76cd912ff 2014-10-08 18:42:02 True
6 B-026 b76cd912ff 2014-10-08 18:42:41 False
8 B-032 b76cd912ff 2014-10-08 18:46:00 True
9 B-037 b76cd912ff 2014-10-08 18:52:15 True
10 B-046 db959faf02 2014-10-08 18:59:59 False
11 B-053 b76cd912ff 2014-10-08 19:17:48 True
12 B-065 b76cd912ff 2014-10-08 19:21:38 False
The correct orders - as per description by OP - are indeed dropped!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.