简体   繁体   English

Python - 用基于另一个数据帧中的索引的条件填充数据帧的最快方法

[英]Python - fastest way to populate a dataframe with a condition based on an index in another dataframe

I have data in an input dataframe (input_df).我在输入数据帧 (input_df) 中有数据。 Based on an index in another benchmark dataframe (bm_df), I would like to create a third dataframe (output_df) that is populated based on a condition using the indices in the original two dataframes.基于另一个基准数据帧 (bm_df) 中的索引,我想创建第三个数据帧 (output_df),该数据帧基于使用原始两个数据帧中的索引的条件进行填充。

For each date in the index for the bm_df I would like to populate my output using the latest data available in the input_df, subject to the condition that the data has an index date before or equal to that in the bm_df.对于 bm_df 索引中的每个日期,我想使用 input_df 中可用的最新数据填充我的输出,前提是数据的索引日期早于或等于 bm_df 中的索引日期。 For example, in the case study data below the output dataframe for the first index date (2019-01-21) would be populated with the data from the input_df datapoint for the 2019-01-21.例如,在案例研究中,第一个索引日期 (2019-01-21) 的输出数据框下方的数据将填充来自 input_df 数据点的 2019-01-21 数据。 However, if a datapoint for the 2019-01-21 did not exist this would use the 2019-01-18.但是,如果 2019-01-21 的数据点不存在,则将使用 2019-01-18。

The use case here is mapping and backfilling large datasets for the latest data available for a given date.这里的用例是为给定日期可用的最新数据映射和回填大型数据集。 I have written up some python to do this for me (which works), however I think there is probably a more pythonic and therefore faster way to implement the solution.我已经写了一些 python 来为我做这件事(有效),但是我认为可能有一个更 pythonic 并且因此更快的方法来实现这个解决方案。 My underlying dataset this is applied to has large dimensions in terms of the number of columns and length of the columns and so I would like something as efficient as possible - my current solution is too slow when run on the full dataset I am using.我所应用的基础数据集在列数和列长度方面具有较大的维度,因此我想要尽可能高效的东西 - 在我正在使用的完整数据集上运行时,我当前的解决方案太慢了。

Any help is much appreciated!任何帮助深表感谢!

input_df:输入_df:

index   data
2019-01-21  0.008
2019-01-18  0.016
2019-01-17  0.006
2019-01-16  0.01
2019-01-15  0.013
2019-01-14  0.017
2019-01-11  0.017
2019-01-10  0.024
2019-01-09  0.032
2019-01-08  0.012

bm_df: bm_df:

index   
2019-01-21  
2019-01-14  
2019-01-07  

output_df:输出_df:

index   data
2019-01-21  0.008
2019-01-14  0.017
2019-01-07  NaN

Please see the code I am currently using below:请参阅下面我目前使用的代码:

import pandas as pd
import numpy as np

# Import datasets
test_index = ['2019-01-21','2019-01-18','2019-01-17','2019-01-16','2019-01-15','2019-01-14','2019-01-11','2019-01-10','2019-01-09','2019-01-08']    
test_data = [0.008, 0.016,0.006,0.01,0.013,0.017,0.017,0.024,0.032,0.012]
input_df= pd.DataFrame(test_data,columns=['data'], index=test_index)

test_index_2= ['2019-01-21','2019-01-14','2019-01-07']  
bm_df= pd.DataFrame(index=test_index_2)

#Preallocate
data_mat= np.zeros([len(bm_df)])

#Loop over bm_df index and find the most recent variable from input_df which from a date before the index date 
for i in range(len(bm_df)):
    #First check to see if there are no dates before the selected date, if true fill with NaN
    if sum(input_df.index <= bm_df.index[i])>0:
        data_mat[i] = input_df['data'][max(input_df.index[input_df.index <= bm_df.index[i]])]
    else:
        data_mat[i] = float('NaN')

output_df= pd.DataFrame(data_mat,columns=['data'],index=bm_df.index)

I have not tested the execution time, but I would rely on join being referenced as efficient in pandas documentation :我还没有测试执行时间,但我会依赖于在 pandas 文档中被引用为有效的join

... Efficiently join multiple DataFrame objects by index at once... ...一次通过索引有效地连接多个 DataFrame 对象...

And I would use shift to get the value for the highest date before the searched one.我会使用 shift 来获取搜索日期之前的最高日期的值。

All that give:所有这一切:

output_df = bm_df.join(input_df.shift(-1), how='left')

             data
2019-01-21  0.016
2019-01-14  0.017
2019-01-07    NaN

This approach is indeed far less versatile that explicit loops.这种方法确实远不如显式循环通用。 It is the price for pandas vectorization.这是熊猫矢量化的代价。 For example for a less than or equal to condition the code will be slightly different.例如,对于小于或等于条件,代码会略有不同。 Here is an example with an additional date in bm_df not present in input_df :这是一个示例,其中bm_df不存在input_df的附加日期:

...
test_index_2= ['2019-01-21','2019-01-14','2019-01-13','2019-01-07']  
...
tmp_df = input_df.join(bm_df).fillna(method='bfill')
output_df = bm_df.join(tmp_df, how='inner')

And we obtain as expected:我们按预期获得:

             data
2019-01-21  0.008
2019-01-14  0.017
2019-01-13  0.017
2019-01-07  0.012

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一个 pandas Z6A8064B5DF479C550570 的值填充一个 pandas dataframe 的最快方法是什么? - What is the fastest way to populate one pandas dataframe based on values from another pandas dataframe? 根据条件迭代熊猫数据框中行子集的最快方法 - Fastest way to iterate subsets of rows in pandas dataframe based on condition 基于来自另一个数据帧的条件填充熊猫数据帧的有效方法 - efficient way to populate pandas dataframe based on conditions from another dataframe 如何根据另一列中的数据填充 dataframe 中的列并在 python 中的另一列上进行条件/切换 - How to populate a column in a dataframe based on data in another column and condition /switch on another column in python 根据另一个 dataframe 中列的 if 语句填充 dataframe 中的列 - Python - Populate a column in a dataframe based on if statement for column in another dataframe - Python 如何根据另一列中满足的条件填充 dataframe 列 - How to populate a dataframe column based on condition met in another column 以最快的方式从 dataframe Python 中的索引创建一个新的字典列表 - Create a new list of dictionary from the index in dataframe Python with the fastest way 根据另一个 dataframe 中的信息填充一个 dataframe - Populate one dataframe based on information in another dataframe 当某列满足基于另一列的特定条件时,是否有一种方法可以迭代地找到数据帧的索引? - Is there a way to iteratively find index of a dataframe when a column satisfies a certain condition based on another column? 根据Python中列表的索引填充int64 DataFrame列 - Populate an int64 DataFrame column based on index of a list in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM