[英]Numpy and Pandas interpolation also changes the original data
I am trying to interpolate data for some missing days. 我正在尝试对数据进行插值处理,以减少丢失的日子。 The orginal data is;
原始数据是;
2012-06-27 00:00:00 17
2012-06-27 01:00:00 17
2012-06-27 02:00:00 18
2012-06-27 03:00:00 18
2012-06-27 04:00:00 19
2012-06-27 05:00:00 20
2012-06-27 06:00:00 22
2012-06-27 07:00:00 23
2012-06-27 08:00:00 25
2012-06-27 09:00:00 27
2012-06-27 10:00:00 27
2012-06-27 11:00:00 29
2012-06-27 12:00:00 29
2012-06-27 13:00:00 30
2012-06-27 14:00:00 30
2012-06-27 15:00:00 29
2012-06-27 16:00:00 28
2012-06-27 17:00:00 26
2012-06-27 18:00:00 25
2012-06-27 19:00:00 24
2012-06-27 20:00:00 23
2012-06-27 21:00:00 23
2012-06-27 22:00:00 16
2012-06-27 23:00:00 15
2012-06-29 00:00:00 15
2012-06-29 01:00:00 16
2012-06-29 02:00:00 16
2012-06-29 03:00:00 16
2012-06-29 04:00:00 17
2012-06-29 05:00:00 17
2012-06-29 06:00:00 18
2012-06-29 07:00:00 19
2012-06-29 08:00:00 20
2012-06-29 09:00:00 22
2012-06-29 10:00:00 22
2012-06-29 11:00:00 22
2012-06-29 12:00:00 22
2012-06-29 13:00:00 22
2012-06-29 14:00:00 22
2012-06-29 15:00:00 22
2012-06-29 16:00:00 21
2012-06-29 17:00:00 19
2012-06-29 18:00:00 17
2012-06-29 19:00:00 16
2012-06-29 20:00:00 15
2012-06-29 21:00:00 14
2012-06-29 22:00:00 14
2012-06-29 23:00:00 13
As you can see 2014-12-28 is missing, so I tried to interpolate it using both Numpy and Pandas. 如您所见,缺少2014-12-28,因此我尝试使用Numpy和Pandas对其进行插值。 For Numpy the code is;
对于Numpy,代码为:
def inter_lin_nan(ts_temp, rule):
ts_temp = ts_temp.resample(rule)
mask = np.isnan(ts_temp)
# interpolling missing values
ts_temp[mask] = np.interp(np.flatnonzero(mask), np.flatnonzero(~mask),ts_temp[~mask])
return(ts_temp)
and with Pandas I used; 和我一起使用的熊猫
df_temp=df_temp.asfreq('1h')
df_temp['Temp2'] = df_temp['temp'].interpolate(method='linear')
The problem is, both of these method does interpolate for the missing day, but they also change original data for 2014-12-29. 问题在于,这两种方法均会在缺失的日期进行插值,但它们也会更改2014-12-29的原始数据。 Do you know why this is happening or am I missing something?
您知道为什么会这样吗,还是我错过了什么?
I cannot reproduce the problem, but this works for me (assuming your data frame is indexed on datetime): 我无法重现该问题,但这对我有用(假设您的数据帧是在日期时间索引的):
df_resampled = df.resample('1H').interpolate(method='linear')
Output: 输出:
As you can see, the lines overlap perfectly for the days where there is data: no original data is 'changed'. 如您所见,在有数据的日子里,两条线完全重叠:没有原始数据被“更改”。 The interpolation seems to make sense too, and in this plot the missing values in the original series were set to 0 to allow a comparison.
插值似乎也很有意义,在该图中,原始序列中的缺失值被设置为0以进行比较。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.