I've got a dataframe "A" (~500k records). It contains two columns: "fromTimestamp" and "toTimestamp".
I've got a dataframe "B" (~5M records). It has some values and a column named "actualTimestamp".
I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged.
I want something similar like this, but much more efficient code:
for index, row in A.iterrows():
cond1 = B['actual_timestamp'] >= row['from_timestamp']
cond2 = B['actual_timestamp'] <= row['to_timestamp']
B.ix[cond1 & cond2, 'corrupted_flag'] = True
What is the fastest/most efficient way to do this in python/pandas?
Update: Sample data
dataframe A (input):
from_timestamp to_timestamp
3 4
6 9
8 10
dataframe B (input):
data actual_timestamp
a 2
b 3
c 4
d 5
e 8
f 10
g 11
h 12
dataframe B (expected output):
data actual_timestamp corrupted_flag
a 2 False
b 3 True
c 4 True
d 5 False
e 8 True
f 10 True
g 11 False
h 12 False
You can use the intervaltree
package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree:
from intervaltree import IntervalTree
tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))
Note that you need to pad A['to_timestamp']
slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree
package (although the lower bound is).
This method improved performance by a little more than a factor of 14
on some sample data I generated (A = 10k rows, B = 100k rows). The performance boost got bigger the more rows I added.
I've used the intervaltree
package with datetime
objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; you just might need to change how upper bounds are padded.
According to the ideas above, my final solution is the following (it does not generate MemoryError on big datasets):
from intervaltree import IntervalTree
import pandas as pd
def flagDataWithGaps(A,B):
A['from_ts'] = A['from'].astype(float)
A['to_ts'] = A['to'].astype(float)
A['to_ts'] = A['to_ts']+0.1
B['actual_ts'] = B['actual'].astype(float)
tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts']))
col = (tree.overlaps(x) for x in B['actual_ts'])
df = pd.DataFrame(col)
B['is_gap'] = df[0]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.