[英]Pandas.resample to a non-integer multiple frequency
I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset.我必须将我的数据集从 10 分钟间隔重新采样到 15 分钟间隔,以使其与另一个数据集同步。 Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.根据我在 stackoverflow 上的搜索,我对如何进行有一些想法,但它们都没有提供干净清晰的解决方案。
Problem set up问题设置
#%% Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y
Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50.现在我希望已经在 10 分钟的值保持不变,并且 **:15 和 **:45 的值是 **:10, **:20 和 **:40, * 的平均值*:50。 The core of the problem is that 15 minutes is not a integer multiple of 10 minutes.问题的核心是15分钟不是10分钟的倍数。 Otherwise simply applying df.resample('10Min', how='mean')
would have worked.否则,简单地应用df.resample('10Min', how='mean')
就可以了。
Simply use the 15 minutes resampling and just live with the small introduced error.只需使用 15 分钟的重新采样,就可以忍受引入的小错误。
Using two forms of resample, with close='left', label='left'
and close='right', label='right'
.使用两个 forms 的重采样, close='left', label='left'
和close='right', label='right'
。 Afterwards I could average both resampled forms.之后我可以平均两个重新采样的 forms。 The results will give me some error on the results, but smaller than the first method.结果会给我一些结果错误,但比第一种方法要小。
Resample everything to 5 minute data and then apply a rolling average.将所有内容重新采样为 5 分钟数据,然后应用滚动平均值。 Something like that is apllied here: Pandas: rolling mean by time interval类似的东西在这里被应用: Pandas: rolling mean by time interval
Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array Therefore I would have to create a new Series with varying weight length.使用不同数量的输入重新采样和平均: 使用 numpy.average 和权重来重新采样 pandas 数组因此我必须创建一个具有不同权重长度的新系列。 Were the weight should be alternating between 1 and 2.重量是否应在 1 和 2 之间交替。
Resample everything to 5 minute data and then apply linear interpolation.将所有内容重新采样为 5 分钟数据,然后应用线性插值。 This method is close to method 3. Pandas data frame: resample with linear interpolation Edit: @Paul H gave a workable solution along these lines, which is stille readable.此方法接近方法 3。 Pandas 数据帧:使用线性插值重新采样编辑:@Paul H 提供了一个可行的解决方案,该解决方案仍然可读。 Thanks!谢谢!
All the methods are not really statisfying for me.所有的方法对我来说都不是很满意。 Some lead to a small error, and other methods would be quite difficult to read for an outsider.有些会导致小错误,而其他方法对于局外人来说很难阅读。
The implementation of method 1, 2 and 5 together with the desired ouput.方法 1、2 和 5 的实现以及所需的输出。 In combination with visualization.结合可视化。
#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')
#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)
#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')
#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0
j = 0
k = 0
for val in ydesired:
if i+k==len(y): k=0
ydesired[j] = np.mean([y[i],y[i+k]])
j+=1
i+=1
if k==0: k=1;
else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')
#%% finalize plot
plt.legend()
As a bonus I have added the code I will use for the interpolation of angles.作为奖励,我添加了我将用于角度插值的代码。 This is done by using complex numbers.这是通过使用复数来完成的。 Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part.因为(还)没有实现复数插值,所以我将复数分成实部和虚部。 After averaging these numbers can be converted to angels again.平均后这些数字可以再次转换为天使。 For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.对于某些角度,这是一种比简单地平均两个角度更好的重新采样方法,例如:345 度和 5 度。
#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)
#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)
#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')
#%% convert complex to degrees
def f(x):
return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)
#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0]
#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360
#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according @Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()
I might be misunderstanding the problem, but does this work? 我可能误解了这个问题,但这有用吗?
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])
import numpy as np
import pandas
data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)
A
2012-01-01 00:00:00 0
2012-01-01 00:10:00 8
2012-01-01 00:20:00 16
2012-01-01 00:30:00 24
2012-01-01 00:40:00 32
2012-01-01 00:50:00 40
2012-01-01 01:00:00 48
2012-01-01 01:10:00 56
2012-01-01 01:20:00 64
2012-01-01 01:30:00 72
2012-01-01 01:40:00 80
2012-01-01 01:50:00 88
2012-01-01 02:00:00 96
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00 8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00 16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00 24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00 32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00 40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00 48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00 56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00 64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00 72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00 80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00 88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00 96
print(df2.interpolate())
A
2012-01-01 00:00:00 0
2012-01-01 00:05:00 4
2012-01-01 00:10:00 8
2012-01-01 00:15:00 12
2012-01-01 00:20:00 16
2012-01-01 00:25:00 20
2012-01-01 00:30:00 24
2012-01-01 00:35:00 28
2012-01-01 00:40:00 32
2012-01-01 00:45:00 36
2012-01-01 00:50:00 40
2012-01-01 00:55:00 44
2012-01-01 01:00:00 48
2012-01-01 01:05:00 52
2012-01-01 01:10:00 56
2012-01-01 01:15:00 60
2012-01-01 01:20:00 64
2012-01-01 01:25:00 68
2012-01-01 01:30:00 72
2012-01-01 01:35:00 76
2012-01-01 01:40:00 80
2012-01-01 01:45:00 84
2012-01-01 01:50:00 88
2012-01-01 01:55:00 92
2012-01-01 02:00:00 96
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])
A
2012-01-01 00:00:00 0
2012-01-01 00:15:00 12
2012-01-01 00:30:00 24
2012-01-01 00:45:00 36
2012-01-01 01:00:00 48
2012-01-01 01:15:00 60
2012-01-01 01:30:00 72
2012-01-01 01:45:00 84
2012-01-01 02:00:00 96
Ok, here's one way to do it. 好的,这是一种方法。
Note this only works since you want the values exactly halfway between the values you already have, time-wise. 请注意,这只能起作用,因为您希望值在时间上恰好位于您已有的值之间 。 Note the last time comes out np.nan
because you don't have any later data. 请注意,最后一次是np.nan
因为您没有任何后续数据。
times_15 = []
current = df.index[0]
while current < df.index[-2]:
current = current + dt.timedelta(minutes=15)
times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]
In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.如果有人完全没有规律地处理数据,这里有一个改编自上述 Paul H 提供的解决方案。
If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.如果您不想在整个时间序列中进行插值,但仅在重新采样有意义的地方进行插值,则可以并排保持插值列并以重新采样和 dropna 结束。
import numpy as np
import pandas
data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
A
2022-01-01 00:03:00 9
2022-01-01 00:06:00 18
2022-01-01 00:08:00 24
2022-01-01 00:18:00 54
2022-01-01 00:25:00 75
2022-01-01 00:27:00 81
2022-01-01 00:30:00 90
Notice resampling this DF without any regularity forces values to the floor index, without interpolating.请注意,在没有任何规律的情况下重新采样此 DF 会强制将值强制为地板索引,而不进行插值。
print(df1.resample('05T').mean())
A
2022-01-01 00:00:00 9.0
2022-01-01 00:05:00 24.0
2022-01-01 00:10:00 39.0
2022-01-01 00:15:00 51.0
2022-01-01 00:20:00 NaN
2022-01-01 00:25:00 79.5
A better solution can be achieved by interpolating in a small enough interval and then resampling.通过在足够小的间隔内插值然后重新采样可以实现更好的解决方案。 The result DF now has too much, but a dropna() brings it close to its original shape.结果 DF 现在有太多了,但是 dropna() 使它接近其原始形状。
index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())
A A_interp
2022-01-01 00:00:00 9.0 9.0
2022-01-01 00:05:00 21.0 15.0
2022-01-01 00:10:00 39.0 30.0
2022-01-01 00:15:00 51.0 45.0
2022-01-01 00:25:00 75.0 75.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.