如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

Question

I've got a dataframe "A" (~500k records). 我有一个数据框“ A”（约50万条记录）。 It contains two columns: "fromTimestamp" and "toTimestamp". 它包含两列：“ fromTimestamp”和“ toTimestamp”。

I've got a dataframe "B" (~5M records). 我有一个数据框“ B”（〜5M条记录）。 It has some values and a column named "actualTimestamp". 它具有一些值和一个名为“ actualTimestamp”的列。

I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged. 我希望标记“ actualTimestamp”的值在任何“ fromTimestamp”和“ toTimestamp”对的值之间的数据框“ B”中的所有行。

I want something similar like this, but much more efficient code: 我想要类似的东西，但是代码效率更高：

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

What is the fastest/most efficient way to do this in python/pandas? 在python / pandas中最快/最有效的方法是什么？

Update: Sample data 更新：样本数据

dataframe A (input): 数据框A（输入）：

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

dataframe B (input): 数据框B（输入）：

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

dataframe B (expected output): 数据框B（预期输出）：

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

Answer 1

You can use the intervaltree package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree: 您可以使用intervaltree包从DataFrame A中的时间戳构建一个间隔树，然后检查DataFrame B中的每个时间戳是否在树中：

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

Note that you need to pad A['to_timestamp'] slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree package (although the lower bound is). 请注意，您需要略微填充A['to_timestamp'] ，因为在intervaltree包中， intervaltree的上限不包括在intervaltree一部分中（尽管下限是）。

This method improved performance by a little more than a factor of 14 on some sample data I generated (A = 10k rows, B = 100k rows). 对于我生成的某些样本数据（A = 10k行，B = 100k行），此方法将性能提高了14倍以上。 The performance boost got bigger the more rows I added. 我添加的行越多，性能提升就越大。

I've used the intervaltree package with datetime objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; 我之前已经将intervaltree包与datetime对象一起使用，因此，如果您的时间戳不是示例数据中的整数，则上面的代码仍然可以使用； you just might need to change how upper bounds are padded. 您可能只需要更改填充上限的方式即可。

Answer 2

According to the ideas above, my final solution is the following (it does not generate MemoryError on big datasets): 根据上述想法，我的最终解决方案如下（它不会在大型数据集上生成MemoryError）：

from intervaltree import IntervalTree
import pandas as pd 

def flagDataWithGaps(A,B): 

    A['from_ts'] = A['from'].astype(float) 
    A['to_ts'] = A['to'].astype(float) 
    A['to_ts'] = A['to_ts']+0.1 
    B['actual_ts'] = B['actual'].astype(float) 

    tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts'])) 
    col = (tree.overlaps(x) for x in B['actual_ts']) 

    df = pd.DataFrame(col) 
    B['is_gap'] = df[0]

如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-05-17 23:07:50

解决方案2
1 2016-05-25 22:43:18

如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-05-17 23:07:50

解决方案2 1 2016-05-25 22:43:18

解决方案1
3 已采纳 2016-05-17 23:07:50

解决方案2
1 2016-05-25 22:43:18