[英]Python Pandas - Moving Average with uneven period lengths
I'm trying to figure out how to deal with time series data in pandas that has uneven period lengths. 我试图弄清楚如何处理周期长度不均匀的熊猫中的时间序列数据。 The first example I'm looking at is how to calculate a moving average for the last 15 days.
我要看的第一个示例是如何计算最近15天的移动平均值。 Here is an example of the data (time is UTC)
这是数据示例(时间为UTC)
index date_time data
46701 1/06/2016 19:27 15.00
46702 1/06/2016 19:28 18.25
46703 1/06/2016 19:30 16.50
46704 1/06/2016 19:33 17.20
46705 1/06/2016 19:34 18.18
I'm not sure if I should just fill in data so its all even 1 minute increments, or if there is a smarter way... If anyone has suggestions it would be much appreciated 我不确定是否应该只填写数据,以便它甚至以1分钟为增量递增,或者是否有更聪明的方法...如果有人提出建议,将不胜感激
Thanks - KC 谢谢-KC
You can do something like this. 你可以做这样的事情。
bfill
(back fill that use next valid value) but another strategy could be more appropriate like ffill
(forward fill that propagates the last valid value). bfill
(使用下一个有效值的bfill
),但是另一种策略可能更合适,例如ffill
(传播最后一个有效值的正向填充)。 Note: This syntax for rolling
has been introduced in pandas 0.18.0 . 注意:此
rolling
语法已在pandas 0.18.0中引入。 However it is possible to do the same thing in previous version with pd.rolling_mean
. 但是,可以使用
pd.rolling_mean
在以前的版本中执行相同的pd.rolling_mean
。
# Test data
d = {'data': [15.0, 18.25, 16.5, 17.199999999999999, 18.18],
'date_time': ['1/06/2016 19:27',
'1/06/2016 19:28',
'1/06/2016 19:30',
'1/06/2016 19:33',
'1/06/2016 19:34'],
'index': [46701, 46702, 46703, 46704, 46705]}
df = DataFrame(d)
df['date_time'] = pd.to_datetime(df['date_time'])
# Setting the date as the index
df.set_index('date_time', inplace=True)
# Resampling data
df = df.resample('1T').bfill()
# Performing moving average
df['moving'] = df['data'].rolling(window=3, center=True).mean()
df.plot(y=['data', 'moving'])
df
data index moving
date_time
2016-01-06 19:27:00 15.00 46701 NaN
2016-01-06 19:28:00 18.25 46702 16.583333
2016-01-06 19:29:00 16.50 46703 17.083333
2016-01-06 19:30:00 16.50 46703 16.733333
2016-01-06 19:31:00 17.20 46704 16.966667
2016-01-06 19:32:00 17.20 46704 17.200000
2016-01-06 19:33:00 17.20 46704 17.526667
2016-01-06 19:34:00 18.18 46705 NaN
Here is an example with missing data. 这是缺少数据的示例。
# Random data parameters
num_sample = (0, 100)
nb_sample = 1000
start_date = '2016-06-02'
freq = '2T'
random_state = np.random.RandomState(0)
# Generating random data
df = pd.DataFrame({'data': random_state.randint(num_sample[0], num_sample[1], nb_sample)},
index=random_state.choice(
pd.date_range(start=pd.to_datetime(start_date), periods=nb_sample * 3,
freq=freq),
nb_sample))
# Removing duplicate index
df = df.groupby(df.index).first()
# Removing data for closed periods
df.loc[(df.index.hour >= 22) | (df.index.hour <= 7),'data'] = np.nan
# Resampling
df = df.resample('1T').ffill()
# Moving average by hours
df['avg'] = df['data'].rolling(window=60).mean()
ax = df.plot(kind='line', subplots=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.