简体   繁体   中英

Python-pandas - Datetimeindex: What is the mosty pythonic strategy to analyse rolling with steps? (e.g. certain hours for each day)

I am working on a data frame with DateTimeIndex of hourly temperature data spanning a couple of years. I want to add a column with the minimum temperature between 20:00 of a day and 8:00 of the following day. Daytime temperatures - from 8:00 to 20:00 - are not of interest. The result can either be at the same hourly resolution of the original data or be resampled to days.

I have researched a number of strategies to solve this, but am unsure about the most efficienct (in terms of primarily coding efficiency and secondary computing efficiency) respectively pythonic way to do this. Some of the possibilities I have come up with:

  1. Attach a column with labels 'day', 'night' depending on df.index.hour and use group_by or df.loc to find the minimum
  2. Resample to 12h and drop every second value. Not sure how I can make the resampling period start at 20:00.
  3. Add a multi-index - I guess this is similar to approach 1, but feels a bit over the top for what I'm trying to achieve.
  4. Use df.between_time ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html#pandas.DataFrame.between_time ) though I'm not sure if the date change over midnight will make this a bit messy.
  5. Lastly there is some discussion about combining rolling with a stepping parameter as new pandas feature: https://github.com/pandas-dev/pandas/issues/15354

Original df looks like this:

datetime                 temp
2009-07-01 01:00:00      17.16
2009-07-01 02:00:00      16.64
2009-07-01 03:00:00      16.21  #<-- minimum for the night 2009-06-30 (previous date since periods starts 2009-06-30 20:00)
...                        ...
2019-06-24 22:00:00      14.03  #<-- minimum for the night 2019-06-24
2019-06-24 23:00:00      18.87
2019-06-25 00:00:00      17.85
2019-06-25 01:00:00      17.25

I want to get something like this (min temp from day 20:00 to day+1 8:00):

datetime                 temp
2009-06-30 23:00:00      16.21
2009-07-01 00:00:00      16.21
2009-07-01 01:00:00      16.21
2009-07-01 02:00:00      16.21
2009-07-01 03:00:00      16.21
...                        ...
2019-06-24 22:00:00      14.03
2019-06-24 23:00:00      14.03
2019-06-25 00:00:00      14.03
2019-06-25 01:00:00      14.03

or a bit more succinct:

datetime    temp
2009-06-30  16.21
...           ...
2019-06-24  14.03

Use the base option to resample :

rs = df.resample('12h', base=8).min()

Then keep only the rows for 20:00:

rs[rs.index.hour == 20]

you can use TimeGrouper with freq=12h and base=8 to chunk the dataframe every 12h from 20:00 - (+day)08:00,

then you can just use .min()

try this:

import pandas as pd
from io import StringIO

s = """
datetime                 temp
2009-07-01 01:00:00      17.16
2009-07-01 02:00:00      16.64
2009-07-01 03:00:00      16.21
2019-06-24 22:00:00      14.03
2019-06-24 23:00:00      18.87
2019-06-25 00:00:00      17.85
2019-06-25 01:00:00      17.25"""

df = pd.read_csv(StringIO(s), sep="\s\s+")
df['datetime'] = pd.to_datetime(df['datetime'])

result = df.sort_values('datetime').groupby(pd.Grouper(freq='12h', base=8, key='datetime')).min()['temp'].dropna()
print(result)

Output:

datetime
2009-06-30 20:00:00    16.21
2019-06-24 20:00:00    14.03
Name: temp, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM