[英]How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?
I've got a dataframe "A" (~500k records). 我有一个数据框“ A”(约50万条记录)。 It contains two columns: "fromTimestamp" and "toTimestamp".
它包含两列:“ fromTimestamp”和“ toTimestamp”。
I've got a dataframe "B" (~5M records). 我有一个数据框“ B”(〜5M条记录)。 It has some values and a column named "actualTimestamp".
它具有一些值和一个名为“ actualTimestamp”的列。
I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged. 我希望标记“ actualTimestamp”的值在任何“ fromTimestamp”和“ toTimestamp”对的值之间的数据框“ B”中的所有行。
I want something similar like this, but much more efficient code: 我想要类似的东西,但是代码效率更高:
for index, row in A.iterrows():
cond1 = B['actual_timestamp'] >= row['from_timestamp']
cond2 = B['actual_timestamp'] <= row['to_timestamp']
B.ix[cond1 & cond2, 'corrupted_flag'] = True
What is the fastest/most efficient way to do this in python/pandas? 在python / pandas中最快/最有效的方法是什么?
Update: Sample data 更新:样本数据
dataframe A (input): 数据框A(输入):
from_timestamp to_timestamp
3 4
6 9
8 10
dataframe B (input): 数据框B(输入):
data actual_timestamp
a 2
b 3
c 4
d 5
e 8
f 10
g 11
h 12
dataframe B (expected output): 数据框B(预期输出):
data actual_timestamp corrupted_flag
a 2 False
b 3 True
c 4 True
d 5 False
e 8 True
f 10 True
g 11 False
h 12 False
You can use the intervaltree
package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree: 您可以使用
intervaltree
包从DataFrame A中的时间戳构建一个间隔树 ,然后检查DataFrame B中的每个时间戳是否在树中:
from intervaltree import IntervalTree
tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))
Note that you need to pad A['to_timestamp']
slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree
package (although the lower bound is). 请注意,您需要略微填充
A['to_timestamp']
,因为在intervaltree
包中, intervaltree
的上限不包括在intervaltree
一部分中(尽管下限是)。
This method improved performance by a little more than a factor of 14
on some sample data I generated (A = 10k rows, B = 100k rows). 对于我生成的某些样本数据(A = 10k行,B = 100k行),此方法将性能提高了
14
倍以上。 The performance boost got bigger the more rows I added. 我添加的行越多,性能提升就越大。
I've used the intervaltree
package with datetime
objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; 我之前已经将
intervaltree
包与datetime
对象一起使用,因此,如果您的时间戳不是示例数据中的整数,则上面的代码仍然可以使用; you just might need to change how upper bounds are padded. 您可能只需要更改填充上限的方式即可。
According to the ideas above, my final solution is the following (it does not generate MemoryError on big datasets): 根据上述想法,我的最终解决方案如下(它不会在大型数据集上生成MemoryError):
from intervaltree import IntervalTree
import pandas as pd
def flagDataWithGaps(A,B):
A['from_ts'] = A['from'].astype(float)
A['to_ts'] = A['to'].astype(float)
A['to_ts'] = A['to_ts']+0.1
B['actual_ts'] = B['actual'].astype(float)
tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts']))
col = (tree.overlaps(x) for x in B['actual_ts'])
df = pd.DataFrame(col)
B['is_gap'] = df[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.