[英]Python - fastest way to populate a dataframe with a condition based on an index in another dataframe
I have data in an input dataframe (input_df).我在输入数据帧 (input_df) 中有数据。 Based on an index in another benchmark dataframe (bm_df), I would like to create a third dataframe (output_df) that is populated based on a condition using the indices in the original two dataframes.
基于另一个基准数据帧 (bm_df) 中的索引,我想创建第三个数据帧 (output_df),该数据帧基于使用原始两个数据帧中的索引的条件进行填充。
For each date in the index for the bm_df I would like to populate my output using the latest data available in the input_df, subject to the condition that the data has an index date before or equal to that in the bm_df.对于 bm_df 索引中的每个日期,我想使用 input_df 中可用的最新数据填充我的输出,前提是数据的索引日期早于或等于 bm_df 中的索引日期。 For example, in the case study data below the output dataframe for the first index date (2019-01-21) would be populated with the data from the input_df datapoint for the 2019-01-21.
例如,在案例研究中,第一个索引日期 (2019-01-21) 的输出数据框下方的数据将填充来自 input_df 数据点的 2019-01-21 数据。 However, if a datapoint for the 2019-01-21 did not exist this would use the 2019-01-18.
但是,如果 2019-01-21 的数据点不存在,则将使用 2019-01-18。
The use case here is mapping and backfilling large datasets for the latest data available for a given date.这里的用例是为给定日期可用的最新数据映射和回填大型数据集。 I have written up some python to do this for me (which works), however I think there is probably a more pythonic and therefore faster way to implement the solution.
我已经写了一些 python 来为我做这件事(有效),但是我认为可能有一个更 pythonic 并且因此更快的方法来实现这个解决方案。 My underlying dataset this is applied to has large dimensions in terms of the number of columns and length of the columns and so I would like something as efficient as possible - my current solution is too slow when run on the full dataset I am using.
我所应用的基础数据集在列数和列长度方面具有较大的维度,因此我想要尽可能高效的东西 - 在我正在使用的完整数据集上运行时,我当前的解决方案太慢了。
Any help is much appreciated!任何帮助深表感谢!
input_df:输入_df:
index data
2019-01-21 0.008
2019-01-18 0.016
2019-01-17 0.006
2019-01-16 0.01
2019-01-15 0.013
2019-01-14 0.017
2019-01-11 0.017
2019-01-10 0.024
2019-01-09 0.032
2019-01-08 0.012
bm_df: bm_df:
index
2019-01-21
2019-01-14
2019-01-07
output_df:输出_df:
index data
2019-01-21 0.008
2019-01-14 0.017
2019-01-07 NaN
Please see the code I am currently using below:请参阅下面我目前使用的代码:
import pandas as pd
import numpy as np
# Import datasets
test_index = ['2019-01-21','2019-01-18','2019-01-17','2019-01-16','2019-01-15','2019-01-14','2019-01-11','2019-01-10','2019-01-09','2019-01-08']
test_data = [0.008, 0.016,0.006,0.01,0.013,0.017,0.017,0.024,0.032,0.012]
input_df= pd.DataFrame(test_data,columns=['data'], index=test_index)
test_index_2= ['2019-01-21','2019-01-14','2019-01-07']
bm_df= pd.DataFrame(index=test_index_2)
#Preallocate
data_mat= np.zeros([len(bm_df)])
#Loop over bm_df index and find the most recent variable from input_df which from a date before the index date
for i in range(len(bm_df)):
#First check to see if there are no dates before the selected date, if true fill with NaN
if sum(input_df.index <= bm_df.index[i])>0:
data_mat[i] = input_df['data'][max(input_df.index[input_df.index <= bm_df.index[i]])]
else:
data_mat[i] = float('NaN')
output_df= pd.DataFrame(data_mat,columns=['data'],index=bm_df.index)
I have not tested the execution time, but I would rely on join
being referenced as efficient in pandas documentation :我还没有测试执行时间,但我会依赖于在 pandas 文档中被引用为有效的
join
:
... Efficiently join multiple DataFrame objects by index at once...
...一次通过索引有效地连接多个 DataFrame 对象...
And I would use shift to get the value for the highest date before the searched one.我会使用 shift 来获取搜索日期之前的最高日期的值。
All that give:所有这一切:
output_df = bm_df.join(input_df.shift(-1), how='left')
data
2019-01-21 0.016
2019-01-14 0.017
2019-01-07 NaN
This approach is indeed far less versatile that explicit loops.这种方法确实远不如显式循环通用。 It is the price for pandas vectorization.
这是熊猫矢量化的代价。 For example for a less than or equal to condition the code will be slightly different.
例如,对于小于或等于条件,代码会略有不同。 Here is an example with an additional date in
bm_df
not present in input_df
:这是一个示例,其中
bm_df
不存在input_df
的附加日期:
...
test_index_2= ['2019-01-21','2019-01-14','2019-01-13','2019-01-07']
...
tmp_df = input_df.join(bm_df).fillna(method='bfill')
output_df = bm_df.join(tmp_df, how='inner')
And we obtain as expected:我们按预期获得:
data
2019-01-21 0.008
2019-01-14 0.017
2019-01-13 0.017
2019-01-07 0.012
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.