[英]What is the best way to remove interpolation from a time series data in Pandas?
是否有更好的方法从熊猫数据帧中的时间序列数据中删除插值数据?
我有一个时间序列数据,其中缺少的值用插值填充,但是我想删除插值的数据,然后再次用np.nan值替换。
输入数据:
Index Column_one Column_two
2017:10:03 03:44:00 13.61504936 14.65000057
2017:10:03 03:45:00 13.61504936 14.65000057
2017:10:03 03:46:00 13.61504936 14.65000057
2017:10:03 03:47:00 13.61504936 np.nan
2017:10:03 03:48:00 13.60000038 np.nan
2017:10:03 03:49:00 np.nan np.nan
2017:10:03 03:50:00 np.nan np.nan
2017:10:03 03:51:00 np.nan np.nan
2017:10:03 03:52:00 np.nan 14.80000019
2017:10:03 03:53:00 np.nan 14.80000019
2017:10:03 03:54:00 14.21253681 14.80000019
2017:10:03 03:55:00 14.24253273 14.80000019
所有缺少的值都用插值填充
data_interpolated = data.interpolate()
插值数据:
Index Column_one Column_two
2017:10:03 03:44:00 13.61504936 14.65000057
2017:10:03 03:45:00 13.61504936 14.65000057
2017:10:03 03:46:00 13.61504936 14.65000057
2017:10:03 03:47:00 13.61504936 14.67500051
2017:10:03 03:48:00 13.60000038 14.70000044
2017:10:03 03:49:00 13.70208979 14.72500038
2017:10:03 03:50:00 13.80417919 14.75000032
2017:10:03 03:51:00 13.9062686 14.77500025
2017:10:03 03:52:00 14.008358 14.80000019
2017:10:03 03:53:00 14.11044741 14.80000019
2017:10:03 03:54:00 14.21253681 14.80000019
2017:10:03 03:55:00 14.24253273 14.80000019
现在,我想删除插值并获取初始数据集。
所需输出:
Index Column_one Column_two
2017:10:03 03:44:00 13.61504936 14.65000057
2017:10:03 03:45:00 13.61504936 14.65000057
2017:10:03 03:46:00 13.61504936 14.65000057
2017:10:03 03:47:00 13.61504936 np.nan
2017:10:03 03:48:00 13.60000038 np.nan
2017:10:03 03:49:00 np.nan np.nan
2017:10:03 03:50:00 np.nan np.nan
2017:10:03 03:51:00 np.nan np.nan
2017:10:03 03:52:00 np.nan 14.80000019
2017:10:03 03:53:00 np.nan 14.80000019
2017:10:03 03:54:00 14.21253681 14.80000019
2017:10:03 03:55:00 14.24253273 14.80000019
请让我知道在Pandas或Numpy中是否有实现此目的的好方法?
我可以给你这样的事情:
for i in xrange(df.__len__()):
if i == 0:
continue
df.loc[i, ('lin_one')] = df.loc[i, ('one')] - df.loc[i - 1, ('one')]
df.loc[i, ('lin_two')] = df.loc[i, ('two')] - df.loc[i - 1, ('two')]
for i in xrange(df.__len__()-1):
if df.lin_one[i] - df.lin_one[i+1] != 0 and df.lin_one[i] - df.lin_one[i+1] < 0.003:
df.loc[i,('one')] = np.nan
if df.lin_two[i] - df.lin_two[i+1] != 0 and df.lin_two[i] - df.lin_two[i+1] < 0.003:
df.loc[i,('two')] = np.nan
这将产生以下输出:
index one lin_one two lin_two
0 2017:10:03 03:44:00 13.615049 0.000000 14.650001 0.000
1 2017:10:03 03:45:00 13.615049 0.000000 14.650001 0.000
2 2017:10:03 03:46:00 13.615049 0.000000 NaN 0.000
3 2017:10:03 03:47:00 13.615049 0.000000 NaN 0.025
4 2017:10:03 03:48:00 NaN -0.015049 NaN 0.025
5 2017:10:03 03:49:00 NaN 0.102089 NaN 0.025
6 2017:10:03 03:50:00 NaN 0.102089 NaN 0.025
7 2017:10:03 03:51:00 NaN 0.102089 NaN 0.025
8 2017:10:03 03:52:00 NaN 0.102089 14.800000 0.025
9 2017:10:03 03:53:00 NaN 0.102089 14.800000 0.000
10 2017:10:03 03:54:00 14.212537 0.102089 14.800000 0.000
11 2017:10:03 03:55:00 14.242533 0.029996 14.800000 0.000
然后可以删除计算列lin_one
和lin_two
:
del df['lin_one']
del df['lin_two']
但是这种方法会杀死未插值数据的一个值...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.