简体   繁体   English

选择Pandas DataFrame中的日期计算夏令时

[英]Selecting dates in Pandas DataFrame to calculate daylight savings time

I'm trying to select a range of dates in a Pandas DataFrame (containing half hourly data) to determine the daylight savings time of those days.我正在尝试 select Pandas DataFrame 中的日期范围(包含半小时数据)来确定那些日子的夏令时。 The start of DST is the last Sunday of September, and it ends on the first Sunday of April.夏令时开始于九月的最后一个星期日,结束于四月的第一个星期日。

import numpy as np
import pandas as pd
from datetime import datetime, date, timedelta

...

df0 = df0.set_index('datetime')

df0['mnth'] = pd.DatetimeIndex(df0.index).month
df0['dow'] = pd.DatetimeIndex(df0.index).dayofweek # Mon=0, ..., Sun=6

start_dst = df0.iloc[(df0.mnth==9) & (df0.dow==6).idxmax()]
end_dst = df0.iloc[(df0.mnth==4) & (df0.dow==6).idxmin()]
df0.index[start_dst:end_dst] = df0.index + pd.Timedelta('1h')

My data is essentially shifted 1 hour backwards in the Sep-Apr period, so I need to add 1h to the timestamps in this period.我的数据在 9 月至 4 月期间基本上向后移动了 1 小时,因此我需要在此期间的时间戳中添加 1 小时。 But when I define start_dst , I get an error但是当我定义start_dst时,我得到一个错误

TypeError: Cannot perform 'and_' with a dtyped [bool] array and scalar of type [bool]

I'm not sure how to change start_dst .我不确定如何更改start_dst

Edit: Here is a sample dataframe:编辑:这是一个示例 dataframe:

# End DST: first Sunday of April, 1h backward (5 Apr 2020)
# Start DST: last Sunday of September, 1h forward (27 Sep 2020)
# 4,5,6 April 2020, 26,27,28 Sep 2020
d1 = '2020-04-04'
d2 = '2020-04-05'
d3 = '2020-04-06'
d4 = '2020-09-26'
d5 = '2020-09-27'
d6 = '2020-09-28'

df1 = pd.DataFrame()
df1['date'] = pd.to_datetime([d1]*24, format='%Y-%m-%d')
df1['time'] = (pd.date_range(d1, periods=24, freq='H') - pd.Timedelta(hours=1)).time
df1 = df1.set_index('date')

df2 = pd.DataFrame()
df2['date'] = pd.to_datetime([d2]*25, format='%Y-%m-%d')
df2['time'] = (pd.date_range(d2, periods=25, freq='H') - pd.Timedelta(hours=1)).time
df2 = df2.set_index('date')

df3 = pd.DataFrame()
df3['date'] = pd.to_datetime([d3]*24, format='%Y-%m-%d')
df3['time'] = (pd.date_range(d3, periods=24, freq='H')).time
df3 = df3.set_index('date')

df4 = pd.DataFrame()
df4['date'] = pd.to_datetime([d4]*24, format='%Y-%m-%d')
df4['time'] = (pd.date_range(d4, periods=24, freq='H')).time
df4 = df4.set_index('date')

df5 = pd.DataFrame()
df5['date'] = pd.to_datetime([d5]*23, format='%Y-%m-%d')
df5a = pd.DataFrame(pd.date_range('00:00', '01:59', freq='H').time)
df5b = pd.DataFrame(pd.date_range('01:00', '01:59', freq='H').time)
df5c = pd.DataFrame(pd.date_range('03:00', '22:00', freq='H').time)
df5['time'] = pd.concat([df5a,df5b,df5c],axis=0).values
df5 = df5.set_index('date')

df6 = pd.DataFrame()
df6['date'] = pd.to_datetime([d6]*24, format='%Y-%m-%d')
df6['time'] = (pd.date_range(d6, periods=24, freq='H') - pd.Timedelta(hours=1)).time
df6 = df6.set_index('date')

df0 = pd.DataFrame()
z = df1.append(df2).append(df3).append(df4).append(df5).append(df6)
df0['datetime'] = pd.to_datetime(z.index.astype(str)+' '+z.time.astype(str),
                            format='%Y-%m-%d %H:%M:%S')
df0 = df0.set_index('datetime')

df0['mnth'] = pd.DatetimeIndex(df0.index).month
df0['dow'] = pd.DatetimeIndex(df0.index).dayofweek # Mon=0, ..., Sun=6
df0['hour'] = pd.DatetimeIndex(df0.index).hour

You can create/define a function that will give you the index by calculating the condition:您可以创建/定义一个函数,通过计算条件为您提供索引:

def get_indexex():
    try:
        idxmx=df0.index==((df0['dow']==6).idxmax())
        idxmn=df0.index==((df0['dow']==6).idxmin())
        start_dst = df0.loc[(df0['mnth']==9) & idxmx]
        end_dst = df0.loc[(df0['mnth']==4) & idxmn]
        if not start_dst.index.tolist():
            return df0.loc[:end_dst.index[-1]].index
        elif not end_dst.index.tolist():
            return  df0.loc[start_dst.index[0]:].index
        else:
            return  df0.loc[start_dst.index[0]:end_dst.index[-1]].index
    except IndexError:
        start_dst=df0.loc[(df0['dow'].eq(6) & df0['mnth'].eq(9)) & df0['hour'].eq(2)]
        end_dst=df0.loc[df0['mnth'].eq(4) & df0['hour'].eq(3)]
        if not start_dst.index.tolist():
            return df0.loc[:end_dst.index[-1]].index
        elif not end_dst.index.tolist():
            return  df0.loc[start_dst.index[0]:].index
        else:
            return  df0.loc[start_dst.index[0]:end_dst.index[-1]].index

Finally:最后:

df0['dt']=df0.index
m=df0.index.isin(get_indexex())
df0.loc[m,'dt']=df0.loc[m,'dt']+pd.Timedelta('1H')
df0.index=df0.pop('dt')

Reasons to some things:一些事情的原因:

  • you can't make change in the index of subset so for this we created 'dt' column and set that value equal to the index of our dataframe您无法更改子集的索引,因此为此我们创建了'dt'列并将该值设置为等于我们数据框的index

  • we make idxmx variable for idxmax() and idxmn variable for idxmin() which are comparing values of idxmax() and idxmin() with the index of dataframe and siving you a bolean array and you are getting error because (df0.dow==6).idxmax() or (df0.dow==6).idxmin() gives you a single value not a Series of boolean value我们做idxmx变量idxmax()和idxmn可变idxmin()这是比较值idxmax()idxmin()index的数据帧和siving你bolean数组,你得到错误,因为(df0.dow==6).idxmax() or (df0.dow==6).idxmin()给你一个单一的值而不是一系列的布尔值

  • we are defining a function named get_indexex() which will give you the indexes of index where condition satisfies to handle such situation when start_dst is an empty dataframe我们正在定义一个名为get_indexex()的函数,当start_dst是空数据帧时,它会为您提供条件满足的索引索引以处理这种情况

  • Also 1 thing to notice here inside the function we are gettting the index upto 0th index of start_dst and last index of end_dst for those cases if start_dst and end_dst contains multiple entries还有一点需要注意的是,如果 start_dst 和 end_dst 包含多个条目,那么在这些情况下,我们正在获取 start_dst 的第 0 个索引和 end_dst 的最后一个索引的索引

I believe the error is because of the idxmax() and idxmin();我相信错误是因为 idxmax() 和 idxmin(); Both return the index number, and this index isn't a bool type.两者都返回索引号,并且该索引不是 bool 类型。 The (df0.mnth==9) and (df0.mnth==4) will return a array of True and False; (df0.mnth==9) 和 (df0.mnth==4) 将返回一个 True 和 False 数组; and when u try compare them, this error will occur.当您尝试比较它们时,会发生此错误。

The thought of dealing manually with DST gives me headache.手动处理 DST 的想法让我很头疼。 Pandas timestamp objects (single values of a Series) have the dst() function, which returns the daylight saving time difference. Pandas 时间戳对象(Series 的单个值)具有dst() function,它返回夏令时时差。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM