GroupBy - Datetime diff() combining additional criteria

Question

I have a dataframe that looks like this:

In [265]: df_2
Out[265]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

I need to drop all repeat 'failed orders' - except for the last one - in any failed orders sequence.

A 'sequence' is a series of failed orders that meet the following criteria:

Placed by the same user - identified by 'ID'

Has 'ORDER_FAILED' == True

No consecutive orders are more than 5 minutes away from each other.

I was hoping this could be done like this:

In [298]: df_2[df_2.ORDER_FAILED == True].sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff().dt.total_seconds()
Out[298]: 
0         NaN
3     17767.0
4         NaN
5       148.0
7       148.0
8        90.0
9       375.0
11     1533.0
Name: DATETIME, dtype: float64

and then use pd.join to reach to this:

In [302]: df_2 = df_2.join(df_tmp); df_2
Out[302]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN
7   B-033  b76cd912ff 2014-10-08 18:44:30         True    148.0
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

However, this is unfortunately not correct. Order 7 should have diff == NaN as this is the first order in a series of failed orders, coming after a successful order by this user (that would be order 6 ).

I realise the way I'm calculating the diff above is faulty, I haven't managed to find a way to 'reset' the counter after every successful order.

The desired correct outcome would be:

In [303]: df_2
Out[303]: 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## <- successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## <- since this is the first failed order in this sequence of failed orders
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN

After this point, I would just mark the orders where diff > 300 with something like this:

>> df_2.ix[df_2['diff'] > 300, 'remove_flag'] = 1
>> df_2.groupby('ID')['remove_flag'].shift(-1) ## <- adjust flag to mark the previous order in the sequence
>> df_2 = df_2[df_2.remove_flag != 1]

which means that, ultimately, the orders that should be kept or discarded are as shown below:

>> df_2 
        A          ID            DATETIME ORDER_FAILED     diff
0   B-028  b76cd912ff 2014-10-08 13:43:27         True      NaN ## STAYS - Failed, but gap to next failed by same user is greater than 5 minutes
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False      NaN ## STAYS - successful order
2   B-076  1a682034f8 2014-10-08 14:29:01        False      NaN ## STAYS - successful order
3   B-023  b76cd912ff 2014-10-08 18:39:34         True  17767.0 ## DISCARD - The next failed order by the same user is only 148 seconds away (less than 5 minutes)
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True      NaN ## STAYS - successful order
5   B-025  b76cd912ff 2014-10-08 18:42:02         True    148.0 ## STAYS - last in this sequence of failed orders by this user
6   B-026  b76cd912ff 2014-10-08 18:42:41        False      NaN ## STAYS - successful order
7   B-033  b76cd912ff 2014-10-08 18:44:30         True      NaN ## DISCARD - The next failed order by the same user is only 90 seconds away (less than 5 minutes)
8   B-032  b76cd912ff 2014-10-08 18:46:00         True     90.0 ## STAYS - next failed order by the same user is more than 5 minutes away
9   B-037  b76cd912ff 2014-10-08 18:52:15         True    375.0 ## STAYS - More than 5 minutes away from previous failed order by the same user
10  B-046  db959faf02 2014-10-08 18:59:59        False      NaN ## STAYS - Successful order
11  B-053  b76cd912ff 2014-10-08 19:17:48         True   1533.0 ## STAYS - too long since last failed order by this same user
12  B-065  b76cd912ff 2014-10-08 19:21:38        False      NaN ## STAYS - Successful order

Any help would be greatly appreciated, thanks!

Answer 1

I'll start with sorting by ID and DATETIME (ascending):

df1 = df.sort_values(by = ['ID','DATETIME'])

Now, if I understand correctly, we need to remove all orders that satisfy the conjunction of the following conditions (by "next" I understand "in the next row"):

the order failed
the next order failed
the time difference between the order and the next one is at most 300 s
(and additionally) the ID is the same as the next ID (otherwise it was the very last order)

My idea is simple: to add appropriate columns so that each row contains all data needed to evaluate these conditions.

This one adds the "next ID" and the "next order" fields:

df1[['Next_ID','Next_ORDER_FAILED']] = df1[['ID','ORDER_FAILED']].shift(-1)

and this one is responsible for the difference in time to the next order:

df1['diff'] = -df1['DATETIME'].diff(-1).dt.total_seconds()

(the relevant differences with period=-1 will be negative, hence the minus sign).

I believe the rest is already quite straightforward.

Update: By the way, we can create a bool mask even without adding new columns to the data frame:

mask = (df1['ORDER_FAILED'] == True) and (df1['ORDER_FAILED'].shift(-1) == True) and ...

UPDATE

There is no real need to order by ID and the overall solution would in fact be somewhat cleaner if groupby() was used properly. Here's how it was done at the end, after the suggestions above.

In [478]: df_3
Out[478]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
3   B-023  b76cd912ff 2014-10-08 18:39:34         True
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
7   B-033  b76cd912ff 2014-10-08 18:44:30         True
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

In [479]: df_3['NEXT_FAILED'] = df_3.sort_values(by='DATETIME').groupby('ID')['ORDER_FAILED'].shift(-1)

In [480]: df_3['SECONDS_TO_NEXT_ORDER'] = -df_3.sort_values(by='DATETIME').groupby('ID')['DATETIME'].diff(-1).dt.total_seconds()

In [481]: condition = (df_3.NEXT_FAILED == True) & (df_3.ORDER_FAILED == True) & (df_3.SECONDS_TO_NEXT_ORDER <= 300)

In [482]: df_3[~condition].drop(['NEXT_FAILED','SECONDS_TO_NEXT_ORDER'], axis=1)
Out[482]: 
        A          ID            DATETIME ORDER_FAILED
0   B-028  b76cd912ff 2014-10-08 13:43:27         True
1   B-054  4a57ed0b02 2014-10-08 14:26:19        False
2   B-076  1a682034f8 2014-10-08 14:29:01        False
4   B-024  f88g8d7sds 2014-10-08 18:40:18         True
5   B-025  b76cd912ff 2014-10-08 18:42:02         True
6   B-026  b76cd912ff 2014-10-08 18:42:41        False
8   B-032  b76cd912ff 2014-10-08 18:46:00         True
9   B-037  b76cd912ff 2014-10-08 18:52:15         True
10  B-046  db959faf02 2014-10-08 18:59:59        False
11  B-053  b76cd912ff 2014-10-08 19:17:48         True
12  B-065  b76cd912ff 2014-10-08 19:21:38        False

The correct orders - as per description by OP - are indeed dropped!

GroupBy - Datetime diff() combining additional criteria

Question

1 answers

solution1
1 ACCPTED 2016-05-09 00:44:38

GroupBy - Datetime diff() combining additional criteria

Question

1 answers

solution1 1 ACCPTED 2016-05-09 00:44:38

solution1
1 ACCPTED 2016-05-09 00:44:38