简体   繁体   中英

Rolling Mean with Time Offset Pandas

I have a data set of timestamps & values in pandas. The interval between timestamps is ~12 seconds over a total timespan of roughly one year but sometimes there are missing points (ie, the time series is irregular so I can't use fixed window sizes).

I want to compute the rolling averages of the values over 1 minute intervals but I'm not getting the behavior that I expected. I found a similar issue here but that was using the sum and also pre-pandas 0.19.0. I am using Pandas 0.23.4.

Sample Data

Time, X
2018-02-02 21:27:00,    75.4356
2018-02-02 21:27:12,    78.29821
2018-02-02 21:27:24,    73.098345
2018-02-02 21:27:36,    78.3331
2018-02-02 21:28:00,    79.111

Note that 2018-02-02 21:27:48 is missing.

For a rolling sum, I could just fill the missing values with 0s but for the mean, I don't want the missing points being counted as real data points (for example, I want the window to be sum(data points over given interval) / number of data points in interval).

I'd like to do it for varying segments of time (ie, 1min, 5min, 15min, 1hr, etc).

The closest I got to getting actual values was to do:

Code

df['rolling_avg']=df.rolling('1T',on='Time').X.mean()

My understanding is that would be the 1 minute rolling averages.

But then, I'm not sure how to interpret the output. I would have expected NaNs for the first 1+1 minute since there is nothing to base the rolled average on but instead I have values.

Output

    Time                X         rolling_avg
0   2018-02-02 21:27:00 75.4356   75.435600
1   2018-02-02 21:27:12 78.29821  76.866905
2   2018-02-02 21:27:24 73.098345 75.610718
3   2018-02-02 21:27:36 78.3331   76.291314
4   2018-02-02 21:28:00 79.111    77.210164

Basically, in this output, df[1].rolling_avg is (Value[0]+Value[1])/2 , though the interval was 12 seconds, not 1 minute.

Is there a way to do what I am trying to do or do I need to write a for-loop to do this manually?

I think the problem might be in your data. And then maybe I'm not solving the problem. I got the same error using your data, but it worked when I tried this.

import  pandas as pd
import numpy as np
import datetime

time = pd.date_range(start='1/1/2018', end='1/02/2018', freq='12s')
time

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:00:12',
               '2018-01-01 00:00:24', '2018-01-01 00:00:36',
               '2018-01-01 00:00:48', '2018-01-01 00:01:00',
               '2018-01-01 00:01:12', '2018-01-01 00:01:24',
               '2018-01-01 00:01:36', '2018-01-01 00:01:48',
               ...
               '2018-01-01 23:58:12', '2018-01-01 23:58:24',
               '2018-01-01 23:58:36', '2018-01-01 23:58:48',
               '2018-01-01 23:59:00', '2018-01-01 23:59:12',
               '2018-01-01 23:59:24', '2018-01-01 23:59:36',
               '2018-01-01 23:59:48', '2018-01-02 00:00:00'],
              dtype='datetime64[ns]', length=7201, freq='12S')

B = np.random.randint(0, 9, 7201)

df = pd.DataFrame(B, time)
df['rolling_avg']=df.rolling('60s', min_periods=3).mean()
df.head(20)

    0                rolling_avg
2018-01-01 00:00:00 5   NaN
2018-01-01 00:00:12 0   NaN
2018-01-01 00:00:24 1   2.0
2018-01-01 00:00:36 0   1.5
2018-01-01 00:00:48 6   2.4
2018-01-01 00:01:00 7   2.8
2018-01-01 00:01:12 6   4.0
2018-01-01 00:01:24 3   4.4
2018-01-01 00:01:36 7   5.8
2018-01-01 00:01:48 6   5.8
2018-01-01 00:02:00 2   4.8
2018-01-01 00:02:12 6   4.8
2018-01-01 00:02:24 1   4.4
2018-01-01 00:02:36 0   3.0
2018-01-01 00:02:48 8   3.4
2018-01-01 00:03:00 2   3.4
2018-01-01 00:03:12 5   3.2
2018-01-01 00:03:24 8   4.6
2018-01-01 00:03:36 4   5.4
2018-01-01 00:03:48 1   4.0

You say: But then, I'm not sure how to interpret the output. I would have expected NaNs for the first 1+1 minute since there is nothing to base the rolled average on but instead I have values.

The method .rolling() takes all values into account where the index is in a 1-minute interval. The interval is ( by default, but you can change this; use the optional parameter closed ) open to the left and closed to the right. Its right end is the current index ( you can change this,too; use the optional parameter center ).
In your case, the first such interval is ] 2018-02-02 21:26:00 , 2018-02-02 21:27:00 ], which contains only the index 2018-02-02 21:27:00 . Therefore the mean is computed over only one value.

I hope this sounds senseful to you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM