修改熊猫数据框的日期索引

Question

我正在尝试编写一个高效的函数，该函数将采用平均大小的数据框（~5000 行）并返回一个包含最新年份（和相同索引）列的数据框，以便对于原始数据框的每个日期索引包含该月份日期介于某个预先指定的开始日期 (st_d) 和结束日期 (end_d) 之间。 我写了一个代码，其中年份递减，直到特定日期索引的月份在所需范围内。 然而，它真的很慢。 对于只有 366 个条目的数据帧，它需要大约 0.2 秒。 我需要使它至少快一个数量级，以便我可以将它重复应用于数以万计的数据帧。 我将非常感谢您对此提出的任何建议。

import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd

def year_replace(st_d, end_d, x):

    tmp = time.perf_counter()

    def prior_year(d):
        # 100 is number of the years back, more than enough.
        for i_t in range(100):

            #The month should have been fully seen in one of the data years.
            t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
            t_end = t_start + MonthEnd(1)
            if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
                break
        if i_t < 99:
            return t_start.year
        else:
            raise BadDataException("Not enough data for Gradient Boosted tree.")

    output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')

    print("time for single dataframe replacement = ", time.perf_counter() - tmp)    

    return output


i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))

st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)

Answer 1

我的建议是：尽可能避免循环并检查是否有更简单的方法可用。

如果我明白你的目标是：

对于给定的start和stop时间戳，找到最新的（更高的）时间戳t ，其中月份是从索引给出的，并且start <= t <= stop

我相信这可以形式化如下（为了方便起见，我保留了您的函数签名）：

def f(start, stop, x):
    assert start < stop
    tmp = time.perf_counter()
    def y(d):
        # Check current year:
        if start <= d.replace(day=1, year=stop.year) <= stop:
            return stop.year
        # Check previous year:
        if start <= d.replace(day=1, year=stop.year-1) <= stop:
            return stop.year-1
        # Otherwise fail:
        raise TypeError("Ooops")
    # Apply to index:
    df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
    print("Tick: ", time.perf_counter() - tmp) 
    return df

它似乎按要求执行得更快（将近二十年，我们应该确定基准，例如：使用timeit ）：

Tick:  0.004744200000004639

无需迭代，您只需检查当前和上一年即可。 如果失败，则不能存在满足您要求的时间戳。

如果必须保留日期，则只需删除replace方法中的day=1 。 如果您要求切割标准不相等，则相应地修改不等式。 以下功能：

def y(d):
    if start < d.replace(year=stop.year) < stop:
        return stop.year
    if start < d.replace(year=stop.year-1) < stop:
        return stop.year-1
    raise TypeError("Ooops")

返回与您相同的数据帧。

修改熊猫数据框的日期索引

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-01-02 01:27:58

修改熊猫数据框的日期索引

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-01-02 01:27:58

解决方案1
1 已采纳 2020-01-02 01:27:58