从最后可用数据创建DataFrame的最快方法

Question

我很难在论坛中找到该问题的答案，因为很难将其放入关键字中。 任何关键字的建议，我们将不胜感激，以便让其他人可以从中受益。

我发现的最接近的问题并未真正回答我的问题。

我的问题如下：

我有一个称为ref DataFrame和一个名为pub的日期列表。 ref包含索引的日期，但这些日期与pub的日期不同（会有一些匹配值）。 我想创建一个新的DataFrame，其中包含pub中的所有日期，但用ref的“最后可用数据”填充它。

因此，说ref是：

Dat          col1 col2 
2015-01-01   5    4
2015-01-02   6    7
2015-01-05   8    9

和pub

2015-01-01
2015-01-04
2015-01-06

我想创建一个DataFrame像这样：

Dat          col1 col2 
2015-01-01   5    4
2015-01-04   6    7
2015-01-06   8    9

因此，性能是一个问题。 所以我正在寻找最快/最快的方法。

提前致谢。

Answer 1

您可以进行外部合并，将新索引设置为Dat ，对其进行排序，向前填充，然后根据pub的日期重新索引。

dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates], 
                   columns=['Dat'])

>>> (ref
     .merge(pub, on='Dat', how='outer')
     .set_index('Dat')
     .sort_index()
     .ffill()
     .reindex(pub.Dat))
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-04     6     7
2015-01-06     8     9

Answer 2

使用np.searchsorted在之后找到索引（“正确”选项；需要正确处理相等性）：

In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']

In [28]: df
Out[28]: 
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-02     6     7
2015-01-05     8     9

In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)

然后重新构建：

In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]: 
            0  1
2015-01-01  5  4
2015-01-04  6  7
2015-01-06  8  9

从最后可用数据创建DataFrame的最快方法

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-04-18 20:14:20

解决方案2
2 2016-04-18 20:36:47

从最后可用数据创建DataFrame的最快方法

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-04-18 20:14:20

解决方案2 2 2016-04-18 20:36:47

解决方案1
2 已采纳 2016-04-18 20:14:20

解决方案2
2 2016-04-18 20:36:47