简体   繁体   中英

Pandas: Calculate average of values for a time frame

I am working on a large datasets that looks like this:

Time,   Value
01.01.2018 00:00:00.000,  5.1398
01.01.2018 00:01:00.000,  5.1298
01.01.2018 00:02:00.000,  5.1438
01.01.2018 00:03:00.000,  5.1228
01.01.2018 00:04:00.000,  5.1168
.... , ,,,,
31.12.2018 23:59:59.000,  6.3498

The data is a minute data from the first day of the year to the last day of the year

I want to use Pandas to find the average of every 5 days.

For example:

Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018

The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018

The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018

and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.

For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.

The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:

Time,   Value
05.01.2018,  5.1398
06.01.2018,  5.1298
07.01.2018,  5.1438
.... , ,,,,
31.12.2018,  6.3498

The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.

I tried to iterate through Python loop but I wanted something better than we can do from Pandas.

Perhaps this will work?

import numpy as np

# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0)  # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})

>>> df.shape
(524161, 2)

Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg . You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.

df = (
    df
    .assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
    .groupby(df['Time'].dt.date)['rolling_5d_avg']
    .last()
)

>>> df.head(10)
Time
2018-01-01         NaN
2018-01-02         NaN
2018-01-03         NaN
2018-01-04         NaN
2018-01-05    5.786603
2018-01-06    5.784011
2018-01-07    5.790133
2018-01-08    5.786967
2018-01-09    5.789944
2018-01-10    5.789299
Name: rolling_5d_avg, dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM