How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

Question

I've got a dataframe "A" (~500k records). It contains two columns: "fromTimestamp" and "toTimestamp".

I've got a dataframe "B" (~5M records). It has some values and a column named "actualTimestamp".

I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged.

I want something similar like this, but much more efficient code:

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

What is the fastest/most efficient way to do this in python/pandas?

Update: Sample data

dataframe A (input):

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

dataframe B (input):

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

dataframe B (expected output):

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

Answer 1

You can use the intervaltree package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree:

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

Note that you need to pad A['to_timestamp'] slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree package (although the lower bound is).

This method improved performance by a little more than a factor of 14 on some sample data I generated (A = 10k rows, B = 100k rows). The performance boost got bigger the more rows I added.

I've used the intervaltree package with datetime objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; you just might need to change how upper bounds are padded.

Answer 2

According to the ideas above, my final solution is the following (it does not generate MemoryError on big datasets):

from intervaltree import IntervalTree
import pandas as pd 

def flagDataWithGaps(A,B): 

    A['from_ts'] = A['from'].astype(float) 
    A['to_ts'] = A['to'].astype(float) 
    A['to_ts'] = A['to_ts']+0.1 
    B['actual_ts'] = B['actual'].astype(float) 

    tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts'])) 
    col = (tree.overlaps(x) for x in B['actual_ts']) 

    df = pd.DataFrame(col) 
    B['is_gap'] = df[0]

How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

Question

2 answers

solution1
3 ACCPTED 2016-05-17 23:07:50

solution2
1 2016-05-25 22:43:18

How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

Question

2 answers

solution1 3 ACCPTED 2016-05-17 23:07:50

solution2 1 2016-05-25 22:43:18

solution1
3 ACCPTED 2016-05-17 23:07:50

solution2
1 2016-05-25 22:43:18