简体   繁体   English

分组滚动回归

[英]Rolling Regression by Group

Hi I have a panel data set. 嗨,我有一个面板数据集。 I'd like to do a rolling window regression for each firm and extract the coefficient of the independent var. 我想对每个公司进行滚动窗口回归,并提取独立变量的系数。 y is the dependent var and x is the independent var. y是因变量,x是因变量。 Rolling window is 12. That is, the first regression uses row 1 to row 12 data, the second regression uses row 2 to row 13 data, etc. Rollapply is used. 滚动窗口为12。也就是说,第一次回归使用第1行到第12行数据,第二次回归使用第2行到第13行数据,依此类推。使用Rollapply。

Here is a question that has the exact same error that I encountered: Rolling by group in data.table R The lucky thing about that question is that it only takes one column but mine takes two columns for regression so I can't make the change accordingly to the recommended answer in that post. 这是一个与我遇到的错误完全相同的问题: 在data.table R中按组滚动该问题的幸运之处在于,它仅占用一列,而我的需要两列进行回归,因此我无法进行更改根据该职位的建议答案。 Here is another post that uses a for loop. 这是另一个使用for循环的帖子。 My real data has more than 2 million observations so it is too slow: rolling regression with dplyr Can any one help? 我的真实数据有超过200万个观测值,所以实在太慢了: 使用dplyr进行滚动回归有任何帮助吗?

My fake data set is as follows: 我的假数据集如下:

dt<-rep(c("AAA","BBB","CCC"),each=24)
dt<-as.data.frame(dt)
names(dt)[names(dt)=="dt"] <- "firm"
a<-c(20100131,20100228,20100331,20100430,20100531,20100630,20100731,20100831,20100930,20101031,20101130,20101231,20110131,20110228,20110331,20110430,20110531,20110630,20110731,20110831,20110930,20111031,20111130,20111231)
dt$time<-rep(a,3)
dt<-dt%>% group_by(firm)%>%
  mutate(y=rnorm(24,10,5))
dt<-dt%>% group_by(firm)%>%
  mutate(x=rnorm(24,5,2))
dt<-as.data.table(dt)

I tried this code: 我尝试了这段代码:

# create rolling regression function
    roll <- function(Z) 
{ 
  t = lm(formula=y~x, data = as.data.frame(Z), na.rm=T); 
  return(t$coef[2]) 
}
dt[,beta := rollapply(dt, width=12, roll, fill=NA, by.column=FALSE, align="right") , by=firm]

I am trying to create a column called "beta" that shows the coefficient of var x. 我正在尝试创建一个名为“ beta”的列,该列显示var x的系数。 So for each firm, the first data should kick in from the 12th observation. 因此,对于每家公司,第一个数据应从第12个观察中得出。

It looks like the regression takes x and y from the 1st row for different groups and the coefficients seems a bit off compared to the result I got from EXCEL. 看来,回归从第一行的x和y获取了不同的组,并且与我从EXCEL得到的结果相比,系数似乎有些偏离。

The second method I tried is the dplyr version: 我尝试的第二种方法是dplyr版本:

dt %>%
group_by(firm) %>%
mutate(dt,beta = rollapply(dt,12,function(x) coef(lm(y~x,data=as.data.frame(x)))[2],by.column= FALSE, fill = NA, align = "right"))

It gives me the same issue. 它给了我同样的问题。 each group has the same number. 每个组具有相同的编号。 Looks like for each firm, the regression takes y and x from the 1st row. 看起来,对于每个公司,回归都从第一行中获取y和x。

Any thoughts? 有什么想法吗? Thank you so much. 非常感谢。

Here is a solution that uses the rollRegres package and data.table package. 这是使用rollRegres包和data.table包的解决方案。 I have also added a modified version of the OP's solution which works (see eddi's comment) and used an example with 2 million observations as the OP mentions 我还添加了OP解决方案的修改版本,该解决方案可以工作(请参阅eddi的评论),并使用了一个示例,其中包含200万观察值,OP提到

#####
# setup data
library(rollRegres)
library(data.table)
library(dplyr)

set.seed(33700919)
n_firms <- 83334 # yields ~ the 2M firm as the OP mentions
dt <- rep(1:n_firms, each = 24)
dt <- data.frame(firm = dt)
a <-c(20100131,20100228,20100331,20100430,20100531,20100630,20100731,20100831,20100930,20101031,20101130,20101231,20110131,20110228,20110331,20110430,20110531,20110630,20110731,20110831,20110930,20111031,20111130,20111231)
dt$time <- rep(a, n_firms)
dt <- dt %>% group_by(firm) %>% mutate(y=rnorm(24,10,5))
dt <- dt %>% group_by(firm) %>% mutate(x=rnorm(24,5,2))
dt <- as.data.table(dt)
nrow(dt) # roughly the 2M rows that the OP mentions
#R [1] 2000016

#####
# fit models
setkey(dt, firm, time) # make sure data is sorted correctly
start_time <- Sys.time() # to show computation time
dt[
  , beta :=
    roll_regres.fit(x = cbind(1, .SD[["x"]]), y = .SD[["y"]],
                    width = 12L)$coefs[, 2],
  by = firm]
Sys.time() - start_time
#R Time difference of 6.526595 secs

# gives the same as OP's solution with minor corrections
library(zoo)
start_time <- Sys.time()
roll <- function(Z)
  lm.fit(x = cbind(1, Z[, "x"]), y = Z[, "y"])$coef[2]
dt[
  , beta_zoo :=
    rollapply(.SD, width=12, roll, fill=NA, by.column=FALSE, align="right"),
  by=firm]
Sys.time() - start_time # much slower
#R Time difference of 1.87341 mins

# gives the same
all.equal(dt$beta, dt$beta_zoo)
#R [1] TRUE

Maybe you can try to change the first argument in rollapply, replace dt to column, dt[, c("y","x")] . 也许您可以尝试更改rollapply中的第一个参数,将dt替换为dt[, c("y","x")] See if it works 看看是否有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM