简体   繁体   English

在时间序列中插入缺失值并使用 R 创建一个时间序列 model,如协整、ARIMA(X) 和 SARIMA(X)

[英]Interpolate missing values in a time-series and create a time-series model like Cointegration, ARIMA(X) and SARIMA(X) using R

I have a dataset with 18 variables and 30k+ observations, with data being collected from sensors every 15min.我有一个包含 18 个变量和 30k+ 观察值的数据集,每 15 分钟从传感器收集一次数据。 Some of the variables collect the temperature and these will be used as independent variables to predict the state of a structure.一些变量收集温度,这些变量将用作自变量来预测结构的 state。 During this period some data is missing due to the sensors being offline and some other observations were removed because they were outliers.在此期间,由于传感器离线而丢失了一些数据,并且由于它们是异常值而删除了其他一些观测值。

Is there a way to interpolate the missing values in a fairly accurate way?有没有办法以相当准确的方式插入缺失值? I've been trying to create a dataset with all time values every 15min and then matching with the dataset with all values我一直在尝试每 15 分钟创建一个具有所有时间值的数据集,然后与具有所有值的数据集匹配

df$Hour <- substr(df$Time,15,16)
df$Hour <- replace(df$Hour, df$Hour > 0 & df$Hour < 15, "00")
df$Hour <- replace(df$Hour, df$Hour > 15 & df$Hour < 30, 15)
df$Hour <- replace(df$Hour, df$Hour > 30 & df$Hour < 45, 30)
df$Hour <- replace(df$Hour, df$Hour > 45 & df$Hour < 60, 45)
library(lubridate)
df$Time <- ymd_hm(paste(substr(df$Time,1,14),df$Hour,sep=""))
allDates <- seq(ISOdate(2018,4,27,14,15), ISOdate(2018,9,10,19,45), by = "15 min")
allValues <- merge(data.frame(Time=allDates),df,all.x=TRUE)

After this I try to use the Amelia package but it return an error: "There are observations in the data that are completely missing" amelia(df, m = 5, p2s = 1, frontend = FALSE, idvars = "id", ts = "Time", cs = NULL, polytime = NULL, splinetime = NULL, intercs = FALSE, lags = 2:17, leads = 2:17, startvals = 0, tolerance = 0.0001, logs = NULL, sqrts = NULL, lgstc = NULL, noms = NULL, ords = NULL, incheck = TRUE, collect = FALSE, arglist = NULL, empri = 0.1*nrow(DadosTS), priors = NULL, autopri = 0.05, emburn = c(0,0), bounds = NULL, max.resample = 100, overimp = NULL, boot.type = "ordinary", ncpus = getOption("amelia.ncpus", 1L), cl = NULL)在此之后,我尝试使用 Amelia package 但它返回错误:“数据中存在完全缺失的观察结果” amelia(df, m = 5, p2s = 1, frontend = FALSE, idvars = "id", ts = "Time", cs = NULL, polytime = NULL, splinetime = NULL, intercs = FALSE, lags = 2:17, leads = 2:17, startvals = 0, tolerance = 0.0001, logs = NULL, sqrts = NULL, lgstc = NULL, noms = NULL, ords = NULL, incheck = TRUE, collect = FALSE, arglist = NULL, empri = 0.1*nrow(DadosTS), priors = NULL, autopri = 0.05, emburn = c(0,0), bounds = NULL, max.resample = 100, overimp = NULL, boot.type = "ordinary", ncpus = getOption("amelia.ncpus", 1L), cl = NULL) amelia(df, m = 5, p2s = 1, frontend = FALSE, idvars = "id", ts = "Time", cs = NULL, polytime = NULL, splinetime = NULL, intercs = FALSE, lags = 2:17, leads = 2:17, startvals = 0, tolerance = 0.0001, logs = NULL, sqrts = NULL, lgstc = NULL, noms = NULL, ords = NULL, incheck = TRUE, collect = FALSE, arglist = NULL, empri = 0.1*nrow(DadosTS), priors = NULL, autopri = 0.05, emburn = c(0,0), bounds = NULL, max.resample = 100, overimp = NULL, boot.type = "ordinary", ncpus = getOption("amelia.ncpus", 1L), cl = NULL)

Any method to deal with this problem / use the available data to create a time-series model would be very much appreciated任何处理此问题的方法/使用可用数据创建时间序列 model 将不胜感激

If you want time series interpolation / imputation you could use the imputeTS package.如果您想要时间序列插值/插补,您可以使用imputeTS package。

library(imputeTS)
x <- na_interpolation(df)

That's it already.就是这样。

But some additional hints: The package is doing the imputation (missing data replacement) based on the inter-time correlations of each variable.但还有一些额外的提示:package 正在根据每个变量的时间间相关性进行插补(缺失数据替换)。 So this won't account for inter-variable correlations (dependent on the data this might improve the imputation).所以这不会考虑变量间的相关性(取决于数据,这可能会改善插补)。 On the other hand Amelia like you used it only looks for inter-variable correlations.另一方面,像您使用的 Amelia 只寻找变量间的相关性。 If one row is completely NA like in your case (NA,NA,NA,NA) it fails, since then there is no inter-variable correlation to estimate the missing value.如果一行完全不适用,就像你的情况(NA,NA,NA,NA)一样,它会失败,因为那时没有变量间相关性来估计缺失值。

In imputeTS you also have different time series imputation algorithms to choose from.在 imputeTS 中,您还可以选择不同的时间序列插补算法。 The code above is for imputation by linear interpolation.上面的代码用于通过线性插值进行插补。 You could also do spline interpolation:你也可以做样条插值:

library(imputeTS)
x <- na_interpolation(df, option = "spine)

Another more advanced option, wich is especially useful if you know your time series has seasonality is using imputation by Kalman Smoothing on state-space-models / state space representation of an ARIMA model.另一个更高级的选项,如果您知道您的时间序列具有季节性,则特别有用的是通过卡尔曼平滑对 ARIMA model 的状态空间模型/state 空间表示进行插补。

library(imputeTS)
x <- na_kalman(df)

There are also other imputation / interpolation functions available in the package (see this paper ) package 中还有其他可用的插补/插值函数(请参阅本文

In general if you expect not too much inter-variable correlation, these listed methods from imputeTS will very likely be a good choice.一般来说,如果您不希望有太多的变量间相关性,那么这些来自 imputeTS 列出的方法很可能是一个不错的选择。 If you do expect a strong inter-variable correlation, then also take a look at other approaches.如果您确实期望有很强的变量间相关性,那么还要看看其他方法。 Eg could also use imputeTS to impute just one column of your dataset and then use AMELIA again (then the error should disappear, since you do not have rows being completely NA any more).例如,还可以使用 imputeTS 仅估算数据集的一列,然后再次使用 AMELIA(然后错误应该消失,因为您不再有完全 NA 的行)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM