在tidyverse中按组滚动回归？

Question

There are many questions about rolling regression in R, but here I am specifically looking for something that uses dplyr , broom and (if needed) purrr . 关于在R中滚动回归有很多问题，但在这里我特别寻找使用dplyr ， broom和（如果需要） purrr 。

This is what makes this question different. 这就是使这个问题与众不同的原因。 I want to be tidyverse consistent. 我想要tidyverse一致。 Is is possible to do a proper running regression with tidy tools such as purrr:map and dplyr ? 是否可以使用诸如purrr:map和dplyr等整洁工具进行正确的运行回归？

Please consider this simple example: 请考虑这个简单的例子：

library(dplyr)
library(purrr)
library(broom)
library(zoo)
library(lubridate)

mydata = data_frame('group' = c('a','a', 'a','a','b', 'b', 'b', 'b'),
                     'y' = c(1,2,3,4,2,3,4,5),
                     'x' = c(2,4,6,8,6,9,12,15),
                     'date' = c(ymd('2016-06-01', '2016-06-02', '2016-06-03', '2016-06-04',
                                    '2016-06-03', '2016-06-04', '2016-06-05','2016-06-06')))

  group     y     x date      
  <chr> <dbl> <dbl> <date>    
1 a      1.00  2.00 2016-06-01
2 a      2.00  4.00 2016-06-02
3 a      3.00  6.00 2016-06-03
4 a      4.00  8.00 2016-06-04
5 b      2.00  6.00 2016-06-03
6 b      3.00  9.00 2016-06-04
7 b      4.00 12.0  2016-06-05
8 b      5.00 15.0  2016-06-06

For each group (in this example, a or b ): 对于每个组（在此示例中， a或b ）：

compute the rolling regression of y on x over the last 2 observations . 计算最后2个观测值的y在x 上的滚动回归。
store the coefficient of that rolling regression in a column of the dataframe. 将滚动回归的系数存储在数据帧的列中。

Of course, as you can see, the rolling regression can only be computed for the last 2 rows in each group. 当然，正如您所看到的，只能计算每组中最后2行的滚动回归。

I have tried to use the following, but without success. 我试过使用以下内容，但没有成功。

data %>% group_by(group) %>% 
  mutate(rolling_coef = do(tidy(rollapply(. ,
                    width=2, 
                    FUN = function(df) {t = lm(formula=y ~ x, 
                                              data = as.data.frame(df), 
                                              na.rm=TRUE); 
                    return(t$coef) },
                    by.column=FALSE, align="right"))))
Error in mutate_impl(.data, dots) : 
  Evaluation error: subscript out of bounds.
In addition: There were 21 warnings (use warnings() to see them)

Any ideas? 有任何想法吗？

Expected output for the last two rows of the first a group is 0.5 and 0.5 (there is indeed a perfect linear correlation between y and x in this example) 用于第一的最后两行预期输出a基团为0.5和0.5（有确实之间的完美的线性相关y和x在本例中）

More specifically: 进一步来说：

mydata_1 <- mydata %>% filter(group == 'a',
                  row_number() %in% c(1,2))
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      1.00  2.00
2 a      2.00  4.00
> tidy(lm(y ~ x, mydata_1))['estimate'][2,]
[1] 0.5

and also 并且

mydata_2 <- mydata %>% filter(group == 'a',
                              row_number() %in% c(2,3)) 
# A tibble: 2 x 3
  group     y     x
  <chr> <dbl> <dbl>
1 a      2.00  4.00
2 a      3.00  6.00
> tidy(lm(y ~ x, mydata_2))['estimate'][2,]
[1] 0.5

EDIT: 编辑：

interesting follow-up to this question here rolling regression with confidence interval (tidyverse) 这个问题的有趣后续在这里滚动回归与置信区间（tidyverse）

Answer 1

Define a function Coef whose argument is formed from cbind(y, x) and which regresses y on x with an intercept, returning the coefficients. 定义一个函数Coef其参数由cbind(y, x)并使用截距在x上对y进行回归，返回系数。 Then apply rollapplyr using the current and prior rows over each group. 然后使用rollapplyr的当前行和先前行应用rollapplyr 。 If by last you meant the 2 prior rows to the current row, ie exclude the current row, then replace 2 with list(-seq(2)) as an argument to rollapplyr . 如果最后你的意思是前两行到当前行，即排除当前行，则用list(-seq(2))替换2作为rollapplyr的参数。

Coef <- . %>% as.data.frame %>% lm %>% coef

mydata %>% 
  group_by(group) %>% 
  do(cbind(reg_col = select(., y, x) %>% rollapplyr(2, Coef, by.column = FALSE, fill = NA),
           date_col = select(., date))) %>%
  ungroup

giving: 赠送：

# A tibble: 8 x 4
  group `reg_col.(Intercept)` reg_col.x date      
  <chr>                 <dbl>     <dbl> <date>    
1 a      NA                      NA     2016-06-01
2 a       0                       0.500 2016-06-02
3 a       0                       0.500 2016-06-03
4 a       0                       0.500 2016-06-04
5 b      NA                      NA     2016-06-03
6 b       0.00000000000000126     0.333 2016-06-04
7 b     - 0.00000000000000251     0.333 2016-06-05
8 b       0                       0.333 2016-06-06

Variation 变异

A variation of the above would be: 以上的变体将是：

mydata %>% 
       group_by(group) %>% 
       do(select(., date, y, x) %>% 
          read.zoo %>% 
          rollapplyr(2, Coef, by.column = FALSE, fill = NA) %>%
          fortify.zoo(names = "date")
       ) %>% 
       ungroup

Slope Only 仅坡度

If only the slope is needed there are further simplifications possible. 如果仅需要斜率，则可以进一步简化。 We use the fact that the slope equals cov(x, y) / var(x) . 我们使用斜率等于cov(x, y) / var(x)的事实。

slope <- . %>% { cov(.[, 2], .[, 1]) / var(.[, 2])}
mydata %>%
       group_by(group) %>%
       mutate(slope = rollapplyr(cbind(y, x), 2, slope, by.column = FALSE, fill = NA)) %>%
       ungroup

Answer 2

Does this do what you're after? 这会做你想要的吗？

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = c(NA, rollapply(data = ., width = 2, FUN = function(df_) {
    d = data.frame(df_)
    d[, 2:3] <- apply(d[,2:3], MARGIN = 2, FUN = as.numeric)
    mod = lm(y ~ x, data = d)
    return(coef(mod)[2])
  }, by.column = FALSE, align = "right"))))

Giving: 赠送：

# A tibble: 8 x 4
# Groups:   group [2]
  group     y     x rolling_coef
  <chr> <dbl> <dbl>        <dbl>
1 a        1.    2.       NA    
2 a        2.    4.        0.500
3 a        3.    6.        0.500
4 a        4.    8.        0.500
5 b        2.    6.       NA    
6 b        3.    9.        0.333
7 b        4.   12.        0.333
8 b        5.   15.        0.333

Edit: Slightly modified code, but data_frame will not accept the . 编辑：稍微修改过代码，但data_frame不接受. group placeholder as an argument- not sure how to fix that. 组占位符作为参数 - 不知道如何解决这个问题。

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = c(NA, rollapplyr(data = ., width = 2, FUN = function(df_) {
    mod = lm(y ~ x, data = .)
    return(coef(mod)[2])
  }, by.column = FALSE))))

Edit 2: Using fill = NA rather than using c(NA, ...) achieves the same result. 编辑2：使用fill = NA而不是使用c(NA, ...)可以获得相同的结果。

data %>% 
  group_by(group) %>% 
  do(data.frame(., rolling_coef = rollapplyr(data = ., width = 2, FUN = function(df_) {
    mod = lm(y ~ x, data = .)
    return(coef(mod)[2])
  }, by.column = FALSE, fill = NA)))

Answer 3

Here is a solution similar to G. Grothendieck's answer but using the rollRegres package. 这是一个类似于G. Grothendieck的解决方案，但使用rollRegres包。 I have to increase the width argument to 3 to avoid an error (by the way, why do you want a regression with so few observations?) 我必须将width参数增加到3以避免错误（顺便说一下，为什么你想要回归这么少的观察？）

library(rollRegres)
Coef <- . %>% { roll_regres.fit(x = cbind(1, .$x), y = .$y, width = 2L)$coefs }

mydata %>%
  group_by(group) %>%
  do(cbind(reg_col = select(., y, x) %>% Coef,
           date_col = select(., date))) %>%
  ungroup
#R  Error in mydata %>% group_by(group) %>% do(cbind(reg_col = select(., y,  :
#R    Assertion on 'width' failed: All elements must be >= 3.

# change width to avoid error
Coef <- . %>% { roll_regres.fit(x = cbind(1, .$x), y = .$y, width = 3L)$coefs }
mydata %>%
  group_by(group) %>%
  do(cbind(reg_col = select(., y, x) %>% Coef,
           date_col = select(., date))) %>%
    ungroup
#R # A tibble: 8 x 4
#R group  reg_col.1 reg_col.2 date
#R <chr>      <dbl>     <dbl> <date>
#R   1 a      NA           NA     2016-06-01
#R 2 a      NA           NA     2016-06-02
#R 3 a       1.54e-15     0.500 2016-06-03
#R 4 a      -5.13e-15     0.5   2016-06-04
#R 5 b      NA           NA     2016-06-03
#R 6 b      NA           NA     2016-06-04
#R 7 b      -3.08e-15     0.333 2016-06-05
#R 8 b      -4.62e-15     0.333 2016-06-06
#R Warning messages:
#R 1: In evalq((function (..., call. = TRUE, immediate. = FALSE, noBreaks. = FALSE,  :
#R    low sample size relative to number of parameters
#R 2: In evalq((function (..., call. = TRUE, immediate. = FALSE, noBreaks. = FALSE,  :
#R    low sample size relative to number of parameters

Answer 4

This is more of an idea than an answer but maybe instead of using group_by try using map and your list of groups: 这不是一个想法而是一个答案，但可能不是使用group_by尝试使用map和你的组列表：

FUN <- function(g, df = NULL) {
  tmp <- tidy(rollapply(
    zoo(filter(df, group == g)),
    width = 2,
    FUN = function(z) {
      t <- lm(y ~ x, data = as.data.frame(z)) ; return(t$coef)
    },
    by.column = FALSE,
    align = "right"
    ))
  tmp$series <- c(rep('intercept', nrow(tmp) / 2), rep('slope', nrow(tmp) / 2))
  spread(tmp, series, value) %>% mutate(group = g)
}

map_dfr(list('a', 'b'), FUN, df = data)

在tidyverse中按组滚动回归？

问题描述

4 个解决方案

解决方案1
10 已采纳 2018-04-11 00:12:00

Variation 变异

Slope Only 仅坡度

解决方案2
2 2018-04-11 00:11:31

解决方案3
2 2018-07-08 12:05:09

解决方案4
1 2018-04-10 22:26:31

在tidyverse中按组滚动回归？

问题描述

4 个解决方案

解决方案1 10 已采纳 2018-04-11 00:12:00

Variation 变异

Slope Only 仅坡度

解决方案2 2 2018-04-11 00:11:31

解决方案3 2 2018-07-08 12:05:09

解决方案4 1 2018-04-10 22:26:31

解决方案1
10 已采纳 2018-04-11 00:12:00

解决方案2
2 2018-04-11 00:11:31

解决方案3
2 2018-07-08 12:05:09

解决方案4
1 2018-04-10 22:26:31