[英]Rolling Mean with Time Offset Pandas
I have a data set of timestamps & values in pandas.我在熊猫中有一个时间戳和值的数据集。 The interval between timestamps is ~12 seconds over a total timespan of roughly one year but sometimes there are missing points (ie, the time series is irregular so I can't use fixed window sizes).在大约一年的总时间跨度内,时间戳之间的间隔约为 12 秒,但有时会丢失点(即,时间序列是不规则的,因此我无法使用固定的窗口大小)。
I want to compute the rolling averages of the values over 1 minute intervals but I'm not getting the behavior that I expected.我想计算 1 分钟间隔内值的滚动平均值,但我没有得到预期的行为。 I found a similar issue here but that was using the sum and also pre-pandas 0.19.0.我在这里发现了一个类似的问题,但这是使用 sum 和 pre-pandas 0.19.0。 I am using Pandas 0.23.4.我正在使用 Pandas 0.23.4。
Sample Data样本数据
Time, X
2018-02-02 21:27:00, 75.4356
2018-02-02 21:27:12, 78.29821
2018-02-02 21:27:24, 73.098345
2018-02-02 21:27:36, 78.3331
2018-02-02 21:28:00, 79.111
Note that 2018-02-02 21:27:48
is missing.请注意,缺少2018-02-02 21:27:48
。
For a rolling sum, I could just fill the missing values with 0s but for the mean, I don't want the missing points being counted as real data points (for example, I want the window to be sum(data points over given interval) / number of data points in interval).对于滚动总和,我可以只用 0 填充缺失值,但对于平均值,我不希望将缺失点计为实际数据点(例如,我希望窗口为总和(给定间隔内的数据点) ) / 间隔中的数据点数)。
I'd like to do it for varying segments of time (ie, 1min, 5min, 15min, 1hr, etc).我想在不同的时间段(即 1 分钟、5 分钟、15 分钟、1 小时等)执行此操作。
The closest I got to getting actual values was to do:我最接近获得实际值的是:
Code代码
df['rolling_avg']=df.rolling('1T',on='Time').X.mean()
My understanding is that would be the 1 minute rolling averages.我的理解是这将是 1 分钟的滚动平均值。
But then, I'm not sure how to interpret the output.但是,我不确定如何解释输出。 I would have expected NaNs for the first 1+1 minute since there is nothing to base the rolled average on but instead I have values.我本来希望在前 1+1 分钟出现 NaN,因为没有什么可作为滚动平均值的基础,但我有值。
Output输出
Time X rolling_avg
0 2018-02-02 21:27:00 75.4356 75.435600
1 2018-02-02 21:27:12 78.29821 76.866905
2 2018-02-02 21:27:24 73.098345 75.610718
3 2018-02-02 21:27:36 78.3331 76.291314
4 2018-02-02 21:28:00 79.111 77.210164
Basically, in this output, df[1].rolling_avg
is (Value[0]+Value[1])/2
, though the interval was 12 seconds, not 1 minute.基本上,在这个输出中, df[1].rolling_avg
是(Value[0]+Value[1])/2
,尽管间隔是 12 秒,而不是 1 分钟。
Is there a way to do what I am trying to do or do I need to write a for-loop to do this manually?有没有办法做我想做的事情,或者我是否需要编写一个 for 循环来手动执行此操作?
I think the problem might be in your data. 我认为问题可能出在您的数据中。 And then maybe I'm not solving the problem. 然后也许我没有解决问题。 I got the same error using your data, but it worked when I tried this. 使用您的数据时,我遇到了同样的错误,但是当我尝试执行此操作时,它可以正常工作。
import pandas as pd
import numpy as np
import datetime
time = pd.date_range(start='1/1/2018', end='1/02/2018', freq='12s')
time
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:00:12',
'2018-01-01 00:00:24', '2018-01-01 00:00:36',
'2018-01-01 00:00:48', '2018-01-01 00:01:00',
'2018-01-01 00:01:12', '2018-01-01 00:01:24',
'2018-01-01 00:01:36', '2018-01-01 00:01:48',
...
'2018-01-01 23:58:12', '2018-01-01 23:58:24',
'2018-01-01 23:58:36', '2018-01-01 23:58:48',
'2018-01-01 23:59:00', '2018-01-01 23:59:12',
'2018-01-01 23:59:24', '2018-01-01 23:59:36',
'2018-01-01 23:59:48', '2018-01-02 00:00:00'],
dtype='datetime64[ns]', length=7201, freq='12S')
B = np.random.randint(0, 9, 7201)
df = pd.DataFrame(B, time)
df['rolling_avg']=df.rolling('60s', min_periods=3).mean()
df.head(20)
0 rolling_avg
2018-01-01 00:00:00 5 NaN
2018-01-01 00:00:12 0 NaN
2018-01-01 00:00:24 1 2.0
2018-01-01 00:00:36 0 1.5
2018-01-01 00:00:48 6 2.4
2018-01-01 00:01:00 7 2.8
2018-01-01 00:01:12 6 4.0
2018-01-01 00:01:24 3 4.4
2018-01-01 00:01:36 7 5.8
2018-01-01 00:01:48 6 5.8
2018-01-01 00:02:00 2 4.8
2018-01-01 00:02:12 6 4.8
2018-01-01 00:02:24 1 4.4
2018-01-01 00:02:36 0 3.0
2018-01-01 00:02:48 8 3.4
2018-01-01 00:03:00 2 3.4
2018-01-01 00:03:12 5 3.2
2018-01-01 00:03:24 8 4.6
2018-01-01 00:03:36 4 5.4
2018-01-01 00:03:48 1 4.0
You say: But then, I'm not sure how to interpret the output.你说:但是,我不确定如何解释输出。 I would have expected NaNs for the first 1+1 minute since there is nothing to base the rolled average on but instead I have values.我本来希望在前 1+1 分钟出现 NaN,因为没有什么可作为滚动平均值的基础,但我有值。
The method .rolling()
takes all values into account where the index is in a 1-minute interval.方法.rolling()
考虑索引在 1 分钟间隔内的所有值。 The interval is ( by default, but you can change this; use the optional parameter closed
) open to the left and closed to the right.间隔是( 默认情况下,但您可以更改此设置;使用可选参数closed
)向左打开并向右关闭。 Its right end is the current index ( you can change this,too; use the optional parameter center
).它的右端是当前索引( 您也可以更改它;使用可选参数center
)。
In your case, the first such interval is ] 2018-02-02 21:26:00
, 2018-02-02 21:27:00
], which contains only the index 2018-02-02 21:27:00
.在您的情况下,第一个这样的间隔是 ] 2018-02-02 21:26:00
, 2018-02-02 21:27:00
],其中仅包含索引2018-02-02 21:27:00
。 Therefore the mean is computed over only one value.因此,平均值只计算一个值。
I hope this sounds senseful to you.我希望这对你来说很有意义。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.