简体   繁体   English

不规则时间序列的滚动回归

[英]Rolling regression on irregular time series

Summary (tldr)摘要 (tldr)

I need to perform a rolling regression on an irregular time series (ie the interval may not even be periodic and go from 0, 1, 2, 3... to ...7, 20, 24, 28... ) that's simple numeric and does not necessarily require date/time, but the rolling window needs be by time.我需要对不规则的时间序列(即间隔甚至可能不是周期性的,从0, 1, 2, 3......7, 20, 24, 28... )执行滚动回归简单的数字,不一定需要日期/时间,但滚动窗口需要按时间。 So if I have a timeseries that is irregularly sampled for 600 seconds and the window is 30, the regression is performed every 30 seconds, and not every 30 samples.因此,如果我有一个时间序列不规则采样 600 秒且窗口为 30,则每 30 秒执行一次回归,而不是每 30 个样本。

I've read examples, and while I could replicate doing rolling sums and medians by time, I can't seem to figure it out for regression.我读过例子,虽然我可以按时间复制滚动总和和中位数,但我似乎无法弄清楚回归。

The problem问题

First of all, I have read some of the other questions with regards to performing rolling functions on irregular time series data, such as this: optimized rolling functions on irregular time series with time-based window , and this: Rolling window over irregular time series .首先,我已经阅读了一些关于在不规则时间序列数据上执行滚动函数的其他问题,例如: 优化滚动函数在不规则时间序列上使用基于时间的窗口,以及这个: 不规则时间序列上的滚动窗口.

The issue is that the examples provided, so far, are simple for equations like sum or median , but I have not yet figured out how to perform a simple rolling regression, ie using lm , that is still based on the same caveat that the window is based on an irregular time series.问题是,到目前为止,提供的示例对于summedian等方程很简单,但我还没有弄清楚如何执行简单的滚动回归,即使用lm ,这仍然基于相同的警告,即窗口基于不规则的时间序列。 Also, my timeseries is much, much simpler;此外,我的时间序列要简单得多; no date is necessary, it's simply time "elapsed".没有日期是必要的,它只是时间“流逝”。

Anyway, getting this right is important to me because with irregular time - for example, a skip in the time interval - may give an over- or underestimate of the coefficients in the rolling regression, as the sample window will include additional time .无论如何,正确处理对我来说很重要,因为在不规则的时间 - 例如,时间间隔中的跳跃 - 可能会高估或低估滚动回归中的系数,因为样本窗口将包括额外的时间

So I was wondering if anyone can help me with creating a function that does this in the simplest way?所以我想知道是否有人可以帮助我创建一个以最简单的方式执行此操作的函数? The dataset is based on measuring a variable over time ie 2 variables: time , and response .该数据集基于随时间测量变量,即 2 个变量: timeresponse Time is measured every x time elapsed units (seconds, minutes, so not date/time formatted), but once in a while it becomes irregular.时间每隔x时间单位测量一次(秒、分钟,所以不是日期/时间格式),但有时它会变得不规则。

For every row in the function, it should perform a linear regression based on a width of n time units.对于函数中的每一行,它应该基于n 个时间单位的宽度执行线性回归。 The width should never exceed n units, but may be floored (ie reduced) to accomodate irregular time sampling.宽度不应超过n 个单位,但可以降低(即减少)以适应不规则的时间采样。 So for example, if the width is specified at 20 seconds, but time is sampled every 6 seconds, then the window will be rounded to 18, not 24 seconds.例如,如果宽度指定为 20 秒,但时间每 6 秒采样一次,则窗口将四舍五入为 18 秒,而不是 24 秒。

I have looked at the question here: How to calculate the average slope within a moving window in R , and I tested that code on an irregular time series, but it looks like it's based on regular time series.我看过这里的问题: 如何计算 R 中移动窗口内的平均斜率,我在不规则时间序列上测试了该代码,但它看起来像是基于规则时间序列。

Sample data:样本数据:

sample <- 
structure(list(x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 47, 48, 
49), y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42, 
41, 40, 41, 40, 39, 38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32, 
31, 30, 29, 28, 29, 28, 27, 26, 25, 26, 25, 24, 23, 22, 21, 20, 
19)), .Names = c("x", "y"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -46L))

My current code (based on a previous question I referred to).我当前的代码(基于我提到的先前问题)。 I know it's not subsetting by time:我知道这不是按时间进行子集化:

library(zoo)
clm <- function(z) coef(lm(y ~ x, as.data.frame(z)))
rollme <- rollapplyr(zoo(sample), 10, clm, by.column = F, fill = NA)

The expected output (manually calculated) is below.预期输出(手动计算)如下。 The output is different from a regular rolling regression -- the numbers are different as soon as the time interval skips at 29 (secs):输出与常规滚动回归不同 - 一旦时间间隔跳过 29(秒),数字就会不同:

    NA
    NA
    NA
    NA
    NA
    NA
    NA
    NA
    NA
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.6
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.605042017
    -0.638888889
    -0.716981132
    -0.597560976
    -0.528301887
    -0.5
    -0.521008403
    -0.642857143
    -0.566666667
    -0.551515152
    -0.551515152
    -0.6
    -0.696969697
    -0.605042017
    -0.638888889
    -0.716981132

I hope I'm providing enough information, but let me know (or give me a guide to a good example somewhere) for me to try this?我希望我提供了足够的信息,但让我知道(或给我一个很好的例子的指南)让我尝试这个?

Other things I have tried: I've tried converting the time to POSIXct format but I don't know how to perform lm on that:我尝试过的其他事情:我尝试将时间转换为 POSIXct 格式,但我不知道如何执行 lm :

require(lubridate)    
x <- as.POSIXct(strptime(sample$x, format = "%S"))

Update : Added tldr section.更新:添加了 tldr 部分。

Try this:试试这个:

# time interval is 1    
sz=10
    pl2=list()
    for ( i in 1:nrow(sample)){
      if (i<sz) period=sz else
      period=length(sample$x[sample$x>(sample$x[i]-sz) & sample$x<=sample$x[i]])-1
      pl2[[i]]=seq(-period,0)
    }

#update for time interval > 1
sz=10
tint=1
pl2=list()
for ( i in 1:nrow(sample)){
  if (i<sz) period=sz else
  period=length(sample$x[sample$x>(sample$x[i]-sz*tint) & sample$x<=sample$x[i]])-1
  pl2[[i]]=seq(-period,0)
}

rollme3 <- rollapplyr(zoo(sample), pl2, clm, by.column = F, fill = NA)

> tail(rollme3)
   (Intercept)          x
41    47.38182 -0.5515152
42    49.20000 -0.6000000
43    53.03030 -0.6969697
44    49.26050 -0.6050420
45    50.72222 -0.6388889
46    54.22642 -0.7169811

For the sake of completeness, here is an answer which uses to aggregate in a non-equi join .为了完整起见,这里有一个使用在 non-equi join 中聚合的答案。

Although there many similar questions, eg, r calculating rolling average with window based on value (not number of rows or date/time variable) , this question deserves an answer on its own as the OP is looking for the coefficients of a rolling regression .尽管有许多类似的问题,例如, r 使用基于值(不是行数或日期/时间变量)的窗口计算滚动平均值,但由于 OP 正在寻找滚动回归的系数,因此该问题本身就值得回答。

library(data.table)
ws <- 10   # size of sliding window in time units
setDT(sample)[.(start = x - ws, end = x), on = .(x > start, x <= end),
              as.list(coef(lm(y ~ x.x))), by = .EACHI]
 xx (Intercept) xx 1: -10 0 50.00000 NA 2: -9 1 50.00000 -1.0000000 3: -8 2 50.00000 -1.0000000 4: -7 3 50.00000 -1.0000000 5: -6 4 50.00000 -1.0000000 6: -5 5 49.61905 -0.7142857 7: -4 6 49.50000 -0.6428571 8: -3 7 49.50000 -0.6428571 9: -2 8 49.55556 -0.6666667 10: -1 9 49.63636 -0.6969697 11: 0 10 49.20000 -0.6000000 12: 1 11 48.88485 -0.5515152 13: 2 12 48.83636 -0.5515152 14: 3 13 49.20000 -0.6000000 15: 4 14 50.12121 -0.6969697 16: 5 15 49.20000 -0.6000000 17: 6 16 48.64242 -0.5515152 18: 7 17 48.59394 -0.5515152 19: 8 18 49.20000 -0.6000000 20: 9 19 50.60606 -0.6969697 21: 10 20 49.20000 -0.6000000 22: 11 21 48.40000 -0.5515152 23: 12 22 48.35152 -0.5515152 24: 13 23 49.20000 -0.6000000 25: 14 24 51.09091 -0.6969697 26: 15 25 49.20000 -0.6000000 27: 16 26 48.15758 -0.5515152 28: 17 27 48.10909 -0.5515152 29: 18 28 49.20000 -0.6000000 30: 19 29 51.57576 -0.6969697 31: 22 32 49.18487 -0.6050420 32: 23 33 50.13889 -0.6388889 33: 24 34 52.47170 -0.7169811 34: 25 35 48.97561 -0.5975610 35: 26 36 46.77358 -0.5283019 36: 27 37 45.75000 -0.5000000 37: 28 38 46.34454 -0.5210084 38: 29 39 50.57143 -0.6428571 39: 30 40 47.95556 -0.5666667 40: 31 41 47.43030 -0.5515152 41: 32 42 47.38182 -0.5515152 42: 33 43 49.20000 -0.6000000 43: 34 44 53.03030 -0.6969697 44: 37 47 49.26050 -0.6050420 45: 38 48 50.72222 -0.6388889 46: 39 49 54.22642 -0.7169811 xx (Intercept) xx

Please note that rows 10 to 30 where the time series is regularly spaced are identical to OP's rollme .请注意,时间序列有规律地间隔的第 10 到 30 行与 OP 的rollme相同。

The call to as.list() forces the result of coef(lm(...)) to appear in separate columns.as.list()的调用强制coef(lm(...))的结果出现在单独的列中。


The code above uses a right aligned rolling window.上面的代码使用右对齐的滚动窗口。 However, the code can be easily adapted to support a left aligned window as well:但是,代码也可以很容易地调整为支持左对齐窗口:

# left aligned window
setDT(sample)[.(start = x, end = x + ws), on = .(x >= start, x < end),
              as.list(coef(lm(y ~ x.x))), by = .EACHI]

With runner one can apply any R function in irregular time series.使用runner可以在不规则的时间序列中应用任何 R 函数。 User has to specify put data to x argument and vector of dates to idx argument (to make windows time dependent).用户必须将数据指定为x参数,并将日期向量指定为idx参数(以使 Windows 时间相关)。 Window width k can be a integer k = 30 or character like in seq.POSIXt k = "30 secs" .窗口宽度k可以是整数k = 30或像 seq.POSIXt k = "30 secs"字符。

  1. First example shows how to obtain both parameters from lm function - output will be a matrix第一个示例显示如何从 lm 函数中获取两个参数 - 输出将是一个矩阵
library(runner)

runner(
  x = sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))
  }
)
  1. Or one can execute runner separately for each parameter或者可以为每个参数单独执行runner
library(runner)

sample$intercept <- runner(
  sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))[1]
  }
)

sample$slope <- runner(
  sample,
  k = "30 secs",
  idx = sample$datetime,
  function(x) {
    coefficients(lm(y ~ x, data = x))[2]
  }
)
head(sample, 15)

#               datetime  x  y intercept      slope
# 1  2020-04-13 09:27:20  0 50  50.00000         NA
# 2  2020-04-13 09:27:21  1 49  50.00000 -1.0000000
# 3  2020-04-13 09:27:25  2 48  50.00000 -1.0000000
# 4  2020-04-13 09:27:29  3 47  50.00000 -1.0000000
# 5  2020-04-13 09:27:29  4 46  50.00000 -1.0000000
# 6  2020-04-13 09:27:32  5 47  49.61905 -0.7142857
# 7  2020-04-13 09:27:34  6 46  49.50000 -0.6428571
# 8  2020-04-13 09:27:38  7 45  49.50000 -0.6428571
# 9  2020-04-13 09:27:38  8 44  49.55556 -0.6666667
# 10 2020-04-13 09:27:41  9 43  49.63636 -0.6969697
# 11 2020-04-13 09:27:44 10 44  49.45455 -0.6363636
# 12 2020-04-13 09:27:47 11 43  49.38462 -0.6153846
# 13 2020-04-13 09:27:48 12 42  49.38462 -0.6153846
# 14 2020-04-13 09:27:49 13 41  49.42857 -0.6263736
# 15 2020-04-13 09:27:50 14 40  49.34066 -0.6263736

Data with datetime column带有日期时间列的数据

sample <- structure(
  list(
    datetime = c(3, 1, 4, 4, 0, 3, 2, 4, 0, 3, 3, 3, 1, 1, 1, 3, 0, 2, 4, 2, 2, 
                 3, 0, 1, 2, 4, 0, 1, 4, 4, 1, 2, 1, 3, 0, 4, 4, 1, 3, 0, 0, 2, 
                 1, 0, 2, 0) + Sys.time(),
    x = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
          20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 32, 33, 34, 35, 36, 37, 38, 
          39, 40, 41, 42, 43, 44, 47, 48, 49), 
    y = c(50, 49, 48, 47, 46, 47, 46, 45, 44, 43, 44, 43, 42, 41, 40, 41, 40, 39,
          38, 37, 38, 37, 36, 35, 34, 35, 34, 33, 32, 31, 30, 29, 28, 29, 28, 27, 
          26, 25, 26, 25, 24, 23, 22, 21, 20,19)
  ), 
  .Names = c("x", "y"), 
  class = c("tbl_df", "tbl", "data.frame"), 
  row.names = c(NA, -46L)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM