简体   繁体   English

Pandas.resample 到非整数倍频

[英]Pandas.resample to a non-integer multiple frequency

I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset.我必须将我的数据集从 10 分钟间隔重新采样到 15 分钟间隔,以使其与另一个数据集同步。 Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.根据我在 stackoverflow 上的搜索,我对如何进行有一些想法,但它们都没有提供干净清晰的解决方案。

Problem问题

Problem set up问题设置

#%% Import modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)


#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y 

Desired output所需 output

Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50.现在我希望已经在 10 分钟的值保持不变,并且 **:15 和 **:45 的值是 **:10, **:20 和 **:40, * 的平均值*:50。 The core of the problem is that 15 minutes is not a integer multiple of 10 minutes.问题的核心是15分钟不是10分钟的倍数。 Otherwise simply applying df.resample('10Min', how='mean') would have worked.否则,简单地应用df.resample('10Min', how='mean')就可以了。

Possible solutions可能的解决方案

  1. Simply use the 15 minutes resampling and just live with the small introduced error.只需使用 15 分钟的重新采样,就可以忍受引入的小错误。

  2. Using two forms of resample, with close='left', label='left' and close='right', label='right' .使用两个 forms 的重采样, close='left', label='left'close='right', label='right' Afterwards I could average both resampled forms.之后我可以平均两个重新采样的 forms。 The results will give me some error on the results, but smaller than the first method.结果会给我一些结果错误,但比第一种方法要小。

  3. Resample everything to 5 minute data and then apply a rolling average.将所有内容重新采样为 5 分钟数据,然后应用滚动平均值。 Something like that is apllied here: Pandas: rolling mean by time interval类似的东西在这里被应用: Pandas: rolling mean by time interval

  4. Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array Therefore I would have to create a new Series with varying weight length.使用不同数量的输入重新采样和平均: 使用 numpy.average 和权重来重新采样 pandas 数组因此我必须创建一个具有不同权重长度的新系列。 Were the weight should be alternating between 1 and 2.重量是否应在 1 和 2 之间交替。

  5. Resample everything to 5 minute data and then apply linear interpolation.将所有内容重新采样为 5 分钟数据,然后应用线性插值。 This method is close to method 3. Pandas data frame: resample with linear interpolation Edit: @Paul H gave a workable solution along these lines, which is stille readable.此方法接近方法 3。 Pandas 数据帧:使用线性插值重新采样编辑:@Paul H 提供了一个可行的解决方案,该解决方案仍然可读。 Thanks!谢谢!

All the methods are not really statisfying for me.所有的方法对我来说都不是很满意。 Some lead to a small error, and other methods would be quite difficult to read for an outsider.有些会导致小错误,而其他方法对于局外人来说很难阅读。

Implementation执行

The implementation of method 1, 2 and 5 together with the desired ouput.方法 1、2 和 5 的实现以及所需的输出。 In combination with visualization.结合可视化。

#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')

#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
        
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)

#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')

#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0 
j = 0 
k = 0 
for val in ydesired:
    if i+k==len(y): k=0
    ydesired[j] = np.mean([y[i],y[i+k]]) 
    j+=1
    i+=1
    if k==0: k=1; 
    else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')


#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')


#%% finalize plot
plt.legend()

Implementation for angles角度的实现

As a bonus I have added the code I will use for the interpolation of angles.作为奖励,我添加了我将用于角度插值的代码。 This is done by using complex numbers.这是通过使用复数来完成的。 Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part.因为(还)没有实现复数插值,所以我将复数分成实部和虚部。 After averaging these numbers can be converted to angels again.平均后这些数字可以再次转换为天使。 For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.对于某些角度,这是一种比简单地平均两个角度更好的重新采样方法,例如:345 度和 5 度。

#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)

#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)

#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')

#%% convert complex to degrees
def f(x):    
     return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)

#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0] 

#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360

#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according @Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()

I might be misunderstanding the problem, but does this work? 我可能误解了这个问题,但这有用吗?

TL;DR version: TL; DR版本:

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df.reindex(index=index_05T).interpolate().loc[index_15T])

Long version 长版

setup fake data 设置假数据

import numpy as np
import pandas

data = np.arange(0, 101, 8)
index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
print(df1)


                      A
2012-01-01 00:00:00   0
2012-01-01 00:10:00   8
2012-01-01 00:20:00  16
2012-01-01 00:30:00  24
2012-01-01 00:40:00  32
2012-01-01 00:50:00  40
2012-01-01 01:00:00  48
2012-01-01 01:10:00  56
2012-01-01 01:20:00  64
2012-01-01 01:30:00  72
2012-01-01 01:40:00  80
2012-01-01 01:50:00  88
2012-01-01 02:00:00  96

So then build a new 5-minute index and reindex the original dataframe 然后构建一个新的5分钟索引并重新索引原始数据帧

index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
df2 = df.reindex(index=index_05T)
print(df2)

                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00 NaN
2012-01-01 00:10:00   8
2012-01-01 00:15:00 NaN
2012-01-01 00:20:00  16
2012-01-01 00:25:00 NaN
2012-01-01 00:30:00  24
2012-01-01 00:35:00 NaN
2012-01-01 00:40:00  32
2012-01-01 00:45:00 NaN
2012-01-01 00:50:00  40
2012-01-01 00:55:00 NaN
2012-01-01 01:00:00  48
2012-01-01 01:05:00 NaN
2012-01-01 01:10:00  56
2012-01-01 01:15:00 NaN
2012-01-01 01:20:00  64
2012-01-01 01:25:00 NaN
2012-01-01 01:30:00  72
2012-01-01 01:35:00 NaN
2012-01-01 01:40:00  80
2012-01-01 01:45:00 NaN
2012-01-01 01:50:00  88
2012-01-01 01:55:00 NaN
2012-01-01 02:00:00  96

and then linearly interpolate 然后线性插值

print(df2.interpolate())
                      A
2012-01-01 00:00:00   0
2012-01-01 00:05:00   4
2012-01-01 00:10:00   8
2012-01-01 00:15:00  12
2012-01-01 00:20:00  16
2012-01-01 00:25:00  20
2012-01-01 00:30:00  24
2012-01-01 00:35:00  28
2012-01-01 00:40:00  32
2012-01-01 00:45:00  36
2012-01-01 00:50:00  40
2012-01-01 00:55:00  44
2012-01-01 01:00:00  48
2012-01-01 01:05:00  52
2012-01-01 01:10:00  56
2012-01-01 01:15:00  60
2012-01-01 01:20:00  64
2012-01-01 01:25:00  68
2012-01-01 01:30:00  72
2012-01-01 01:35:00  76
2012-01-01 01:40:00  80
2012-01-01 01:45:00  84
2012-01-01 01:50:00  88
2012-01-01 01:55:00  92
2012-01-01 02:00:00  96

build a 15-minute index and use that to pull out data: 构建一个15分钟的索引并使用它来提取数据:

index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
print(df2.interpolate().loc[index_15T])

                      A
2012-01-01 00:00:00   0
2012-01-01 00:15:00  12
2012-01-01 00:30:00  24
2012-01-01 00:45:00  36
2012-01-01 01:00:00  48
2012-01-01 01:15:00  60
2012-01-01 01:30:00  72
2012-01-01 01:45:00  84
2012-01-01 02:00:00  96

Ok, here's one way to do it. 好的,这是一种方法。

  1. Make a list of the times you want to have filled in 列出您要填写的时间
  2. Make a combined index that includes the times you want and the times you already have 制作包含您想要的时间和已有时间的综合索引
  3. Take your data and "forward fill it" 获取您的数据并“向前填充”
  4. Take your data and "backward fill it" 获取您的数据并“向后填充”
  5. Average the forward and backward fills 平均向前和向后填充
  6. Select only the rows you want 仅选择所需的行

Note this only works since you want the values exactly halfway between the values you already have, time-wise. 请注意,这只能起作用,因为您希望值时间上恰好位于您已有的值之间 Note the last time comes out np.nan because you don't have any later data. 请注意,最后一次是np.nan因为您没有任何后续数据。

times_15 = []
current = df.index[0]
while current < df.index[-2]:
    current = current + dt.timedelta(minutes=15)
    times_15.append(current)
combined = set(times_15) | set(df.index)
df = df.reindex(combined).sort_index(axis=0)
df['ff'] = df['y'].fillna(method='ffill')
df['bf'] = df['y'].fillna(method='bfill')
df['solution'] = df[['ff', 'bf']].mean(1)
df.loc[times_15, :]

In case someone is working with data without regularity at all, here is an adapted solution from the one provided by Paul H above.如果有人完全没有规律地处理数据,这里有一个改编自上述 Paul H 提供的解决方案。

If you don't want to interpolate throughout the time-series, but only in those places where resample is meaningful, you may keep the interpolated column side by side and finish with a resample and dropna.如果您不想在整个时间序列中进行插值,但仅在重新采样有意义的地方进行插值,则可以并排保持插值列并以重新采样和 dropna 结束。

import numpy as np
import pandas

data = np.arange(0, 101, 3)
index_setup = pandas.date_range(freq='01T', start='2022-01-01 00:00',     periods=data.shape[0])
df1 = pandas.DataFrame(data=data, index=index_setup, columns=['A'])
df1 = df1.sample(frac=0.2).sort_index()
print(df1)
                      A
2022-01-01 00:03:00   9
2022-01-01 00:06:00  18
2022-01-01 00:08:00  24
2022-01-01 00:18:00  54
2022-01-01 00:25:00  75
2022-01-01 00:27:00  81
2022-01-01 00:30:00  90

Notice resampling this DF without any regularity forces values to the floor index, without interpolating.请注意,在没有任何规律的情况下重新采样此 DF 会强制将值强制为地板索引,而不进行插值。

print(df1.resample('05T').mean())

                        A
2022-01-01 00:00:00   9.0
2022-01-01 00:05:00  24.0
2022-01-01 00:10:00  39.0
2022-01-01 00:15:00  51.0
2022-01-01 00:20:00   NaN
2022-01-01 00:25:00  79.5

A better solution can be achieved by interpolating in a small enough interval and then resampling.通过在足够小的间隔内插值然后重新采样可以实现更好的解决方案。 The result DF now has too much, but a dropna() brings it close to its original shape.结果 DF 现在有太多了,但是 dropna() 使它接近其原始形状。

index_1min = pandas.date_range(freq='01T', start='2022-01-01 00:00', end='2022-01-01 23:59')
df2 = df1.reindex(index=index_1min)
df2['A_interp'] = df2['A'].interpolate(limit_direction='both')
print(df2.resample('05T').first().dropna())

                        A  A_interp
2022-01-01 00:00:00   9.0       9.0
2022-01-01 00:05:00  21.0      15.0
2022-01-01 00:10:00  39.0      30.0
2022-01-01 00:15:00  51.0      45.0
2022-01-01 00:25:00  75.0      75.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM