如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

Question

我有一个数据框“ A”（约50万条记录）。 它包含两列：“ fromTimestamp”和“ toTimestamp”。

我有一个数据框“ B”（〜5M条记录）。 它具有一些值和一个名为“ actualTimestamp”的列。

我希望标记“ actualTimestamp”的值在任何“ fromTimestamp”和“ toTimestamp”对的值之间的数据框“ B”中的所有行。

我想要类似的东西，但是代码效率更高：

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

在python / pandas中最快/最有效的方法是什么？

更新：样本数据

数据框A（输入）：

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

数据框B（输入）：

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

数据框B（预期输出）：

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

Answer 1

您可以使用intervaltree包从DataFrame A中的时间戳构建一个间隔树，然后检查DataFrame B中的每个时间戳是否在树中：

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

请注意，您需要略微填充A['to_timestamp'] ，因为在intervaltree包中， intervaltree的上限不包括在intervaltree一部分中（尽管下限是）。

对于我生成的某些样本数据（A = 10k行，B = 100k行），此方法将性能提高了14倍以上。 我添加的行越多，性能提升就越大。

我之前已经将intervaltree包与datetime对象一起使用，因此，如果您的时间戳不是示例数据中的整数，则上面的代码仍然可以使用； 您可能只需要更改填充上限的方式即可。

Answer 2

根据上述想法，我的最终解决方案如下（它不会在大型数据集上生成MemoryError）：

from intervaltree import IntervalTree
import pandas as pd 

def flagDataWithGaps(A,B): 

    A['from_ts'] = A['from'].astype(float) 
    A['to_ts'] = A['to'].astype(float) 
    A['to_ts'] = A['to_ts']+0.1 
    B['actual_ts'] = B['actual'].astype(float) 

    tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts'])) 
    col = (tree.overlaps(x) for x in B['actual_ts']) 

    df = pd.DataFrame(col) 
    B['is_gap'] = df[0]

如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-05-17 23:07:50

解决方案2
1 2016-05-25 22:43:18

如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式？

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-05-17 23:07:50

解决方案2 1 2016-05-25 22:43:18

解决方案1
3 已采纳 2016-05-17 23:07:50

解决方案2
1 2016-05-25 22:43:18