简体   繁体   English

如何通过python / pandas中另一个数据框的值来标记一个数据框的列的最有效方式?

[英]How to flag the most efficient way a column of a dataframe by values of another dataframe's in python/pandas?

I've got a dataframe "A" (~500k records). 我有一个数据框“ A”(约50万条记录)。 It contains two columns: "fromTimestamp" and "toTimestamp". 它包含两列:“ fromTimestamp”和“ toTimestamp”。

I've got a dataframe "B" (~5M records). 我有一个数据框“ B”(〜5M条记录)。 It has some values and a column named "actualTimestamp". 它具有一些值和一个名为“ actualTimestamp”的列。

I want all of my rows in dataframe "B" where the value of "actualTimestamp" is between the values of any "fromTimestamp" and "toTimestamp" pair to be flagged. 我希望标记“ actualTimestamp”的值在任何“ fromTimestamp”和“ toTimestamp”对的值之间的数据框“ B”中的所有行。

I want something similar like this, but much more efficient code: 我想要类似的东西,但是代码效率更高:

for index, row in A.iterrows():
    cond1 = B['actual_timestamp'] >= row['from_timestamp']
    cond2 = B['actual_timestamp'] <= row['to_timestamp']
    B.ix[cond1 & cond2, 'corrupted_flag'] = True

What is the fastest/most efficient way to do this in python/pandas? 在python / pandas中最快/最有效的方法是什么?

Update: Sample data 更新:样本数据

dataframe A (input): 数据框A(输入):

from_timestamp    to_timestamp
3                 4             
6                 9
8                 10

dataframe B (input): 数据框B(输入):

data    actual_timestamp
a       2
b       3
c       4
d       5
e       8
f       10
g       11
h       12

dataframe B (expected output): 数据框B(预期输出):

data    actual_timestamp   corrupted_flag
a       2                  False
b       3                  True
c       4                  True
d       5                  False
e       8                  True
f       10                 True
g       11                 False
h       12                 False

You can use the intervaltree package to build an interval tree from the timestamps in DataFrame A, and then check if each timestamp from DataFrame B is in the tree: 您可以使用intervaltree包从DataFrame A中的时间戳构建一个间隔树 ,然后检查DataFrame B中的每个时间戳是否在树中:

from intervaltree import IntervalTree

tree = IntervalTree.from_tuples(zip(A['from_timestamp'], A['to_timestamp'] + 0.1))
B['corrupted_flag'] = B['actual_timestamp'].map(lambda x: tree.overlaps(x))

Note that you need to pad A['to_timestamp'] slightly, as the upper bound of an interval is not included as part of the interval in the intervaltree package (although the lower bound is). 请注意,您需要略微填充A['to_timestamp'] ,因为在intervaltree包中, intervaltree的上限不包括在intervaltree一部分中(尽管下限是)。

This method improved performance by a little more than a factor of 14 on some sample data I generated (A = 10k rows, B = 100k rows). 对于我生成的某些样本数据(A = 10k行,B = 100k行),此方法将性能提高了14倍以上。 The performance boost got bigger the more rows I added. 我添加的行越多,性能提升就越大。

I've used the intervaltree package with datetime objects before, so the code above should still work if your timestamps aren't integers like they are in your sample data; 我之前已经将intervaltree包与datetime对象一起使用,因此,如果您的时间戳不是示例数据中的整数,则上面的代码仍然可以使用; you just might need to change how upper bounds are padded. 您可能只需要更改填充上限的方式即可。

According to the ideas above, my final solution is the following (it does not generate MemoryError on big datasets): 根据上述想法,我的最终解决方案如下(它不会在大型数据集上生成MemoryError):

from intervaltree import IntervalTree
import pandas as pd 

def flagDataWithGaps(A,B): 

    A['from_ts'] = A['from'].astype(float) 
    A['to_ts'] = A['to'].astype(float) 
    A['to_ts'] = A['to_ts']+0.1 
    B['actual_ts'] = B['actual'].astype(float) 

    tree = IntervalTree.from_tuples(zip(A['from_ts'], A['to_ts'])) 
    col = (tree.overlaps(x) for x in B['actual_ts']) 

    df = pd.DataFrame(col) 
    B['is_gap'] = df[0]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Pandas DataFrame 中转换列值的最有效方法 - Most efficient way to convert values of column in Pandas DataFrame pandas DataFrame 中映射列的最有效方法 - Most efficient way of mapping column in pandas DataFrame 如何编写最有效的方法来为数据框python中的列添加值? - How to write most efficient way to add a value for an column in dataframe python? 将pandas dataframe列拆分为多个列的最有效方法 - Most efficient way to split a pandas dataframe column into several columns 在 pandas Dataframe 中处理字符串列的最有效方法 - Most efficient way to work with a string column in a pandas Dataframe 在 pandas dataframe 中计算不同值的最有效方法是什么? - What is the most efficient way to get count of distinct values in a pandas dataframe? 用大量可能的值熊猫来融化数据框的最有效方法 - Most efficient way to melt dataframe with a ton of possible values pandas 在 Python/Pandas 中,将自定义 function 应用于输入包含字符串的 dataframe 的列的最有效方法是什么? - In Python/Pandas, what is the most efficient way, to apply a custom function, to a column of a dataframe, where the input includes strings? 基于另一个数据框 python pandas 替换列值 - 更好的方法? - Replace column values based on another dataframe python pandas - better way? Python pandas dataframe 中字典映射的最有效方法 - Python most efficient way to dictionary mapping in pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM