[英]Modifying the date index of pandas dataframe
I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d).我正在尝试编写一个高效的函数,该函数将采用平均大小的数据框(~5000 行)并返回一个包含最新年份(和相同索引)列的数据框,以便对于原始数据框的每个日期索引包含该月份日期介于某个预先指定的开始日期 (st_d) 和结束日期 (end_d) 之间。 I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range.
我写了一个代码,其中年份递减,直到特定日期索引的月份在所需范围内。 However, it is really slow.
然而,它真的很慢。 For the dataframe with only 366 entries it takes ~0.2s.
对于只有 366 个条目的数据帧,它需要大约 0.2 秒。 I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes.
我需要使它至少快一个数量级,以便我可以将它重复应用于数以万计的数据帧。 I would very much appreciate any suggestions for this.
我将非常感谢您对此提出的任何建议。
import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd
def year_replace(st_d, end_d, x):
tmp = time.perf_counter()
def prior_year(d):
# 100 is number of the years back, more than enough.
for i_t in range(100):
#The month should have been fully seen in one of the data years.
t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
t_end = t_start + MonthEnd(1)
if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
break
if i_t < 99:
return t_start.year
else:
raise BadDataException("Not enough data for Gradient Boosted tree.")
output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')
print("time for single dataframe replacement = ", time.perf_counter() - tmp)
return output
i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))
st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)
My advice is: avoid loop whenever you can and check out if an easier way is available.我的建议是:尽可能避免循环并检查是否有更简单的方法可用。
If I do understand what you aim to do is:如果我明白你的目标是:
For given
start
andstop
timestamps, find the latest (higher) timestampt
where month is given from index andstart <= t <= stop
对于给定的
start
和stop
时间戳,找到最新的(更高的)时间戳t
,其中月份是从索引给出的,并且start <= t <= stop
I believe this can be formalized as follow (I kept your function signature for conveniance):我相信这可以形式化如下(为了方便起见,我保留了您的函数签名):
def f(start, stop, x):
assert start < stop
tmp = time.perf_counter()
def y(d):
# Check current year:
if start <= d.replace(day=1, year=stop.year) <= stop:
return stop.year
# Check previous year:
if start <= d.replace(day=1, year=stop.year-1) <= stop:
return stop.year-1
# Otherwise fail:
raise TypeError("Ooops")
# Apply to index:
df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
print("Tick: ", time.perf_counter() - tmp)
return df
It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit
):它似乎按要求执行得更快(将近二十年,我们应该确定基准,例如:使用
timeit
):
Tick: 0.004744200000004639
There is no need to iterate, you can just check current and previous year.无需迭代,您只需检查当前和上一年即可。 If it fails, it cannot exist a timestamp fulfilling your requirements.
如果失败,则不能存在满足您要求的时间戳。
If the day must be kept, then just remove the day=1
in replace
method.如果必须保留日期,则只需删除
replace
方法中的day=1
。 If you require cut criteria not being equal then modify inequalities accordingly.如果您要求切割标准不相等,则相应地修改不等式。 The following function:
以下功能:
def y(d):
if start < d.replace(year=stop.year) < stop:
return stop.year
if start < d.replace(year=stop.year-1) < stop:
return stop.year-1
raise TypeError("Ooops")
Returns the same dataframe as yours.返回与您相同的数据帧。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.