简体   繁体   中英

Rolling regression with unbalanced panel

I am trying to compute a measure of environmental complexity based on a rolling regression formula. The formula for complexity is defined as follows: for a rolling three-year period, and for all firms in the industry, sale values at the end-year are regressed on sale values at the beginning year. The coefficient of the regression is defined as environmental complexity.

This is where I run into a problem: firms add to and drop from an industry in between years, making the initial Xs and the Ys on the regression equation unequal in number.

To better illustrate this, given the following sample dataset:

> set.seed(123)
> df <- data.table("Firm" = c("A","B","A","B","C","A","B","C","D","B","C","D","E","F"),
+                  "Year" = c(1980,1980,1981,1981,1981,1982,1982,1982,1982,1983,1983,1983,1983,1983),
+                  "Industry" = rep("I1",14),
+                  "Sale" = sample(1000:1500,14))
> df
    Firm Year Industry Sale
 1:    A 1980       I1 1414
 2:    B 1980       I1 1462
 3:    A 1981       I1 1178
 4:    B 1981       I1 1013
 5:    C 1981       I1 1194
 6:    A 1982       I1 1425
 7:    B 1982       I1 1305
 8:    C 1982       I1 1117
 9:    D 1982       I1 1298
10:    B 1983       I1 1228
11:    C 1983       I1 1243
12:    D 1983       I1 1497
13:    E 1983       I1 1373
14:    F 1983       I1 1152

To measure complexity at 1983, I'll have to look at the 1981-1983 time span. However, there are 5 firms in the industry in 1983, and three firm in 1981, and there's only two common firms (B and C) between them. So, to measure complexity at 1983, I have to first take only firms B and C, and then regress their sale values at 1983 (1228 and 1243) on their sale values at 1981 (1178 and 1013), which will result in the coefficient 0.08287293.

The desired output should look like:

> df
    Firm Year Industry Sale    compx
 1:    A 1980       I1 1414       NA
 2:    B 1980       I1 1462       NA
 3:    A 1981       I1 1178       NA
 4:    B 1981       I1 1013       NA
 5:    C 1981       I1 1194       NA
 6:    A 1982       I1 1425 -2.50000
 7:    B 1982       I1 1305 -2.50000
 8:    C 1982       I1 1117 -2.50000
 9:    D 1982       I1 1298 -2.50000
10:    B 1983       I1 1228 0.08287293
11:    C 1983       I1 1243 0.08287293
12:    D 1983       I1 1497 0.08287293
13:    E 1983       I1 1373 0.08287293
14:    F 1983       I1 1152 0.08287293

I have added a variable for industry because I want to iterate the procedure for each industry. A data.table solution would be great as my dataset is rather large.

Many thanks in advance.

Edit: My apologies, I had entered an incorrect desired output. It's edited now.

One simple solution would be to write a fit function that takes a subset of your data, reshapes it using dcast to have the first and last year as separate columns, runs the regression and extracts the coefficient. Then, you can loop over years and merge that back in to your main dataset.

Here are two different approaches. The first one does reshaping at the start and end. The second one reshapes inside the loop.

First:

df_wide <- dcast(df, Firm + Industry ~ Year, value.var="Sale")

for (i in ncol(df_wide):3) {
  if (i < 5) {
    df_wide[[i]] <- NA_real_
  } else {
    f <- paste0("`", colnames(df_wide)[i], "`~`", colnames(df_wide)[i-2], "`")
    f <- as.formula(f)
    df_wide[[i]] <- coef(lm(f, df_wide))[2]
  }
}

melt(df_wide, 
     id.vars=c("Firm", "Industry"), 
     variable.name="Year",
     value.name="compx")
    

Second:

fit <- function(year=1980) {
  year_min <- year - 2
  year_max <- year 
  if (all(year_min:year_max %in% df$Year)) {
    tmp <- df[Year %in% year_min:year_max]
    tmp <- dcast(tmp, Firm + Industry ~ Year, value.var="Sale")
    f <- as.formula(paste0("`", year_max, "`~`", year_min, "`"))
    out <- coef(lm(f, tmp))[2]
  } else {
    out <- NA
  }
  out <- data.table(Year=year, compx=out)
  return(out)
}

results <- list()
for (y in 1980:1985) {
  results[[y]] <- fit(y)
}
results <- rbindlist(results)

merge(df, results, on="Year")
#>     Year Firm Industry Sale       compx
#>  1: 1980    A       I1 1414          NA
#>  2: 1980    B       I1 1462          NA
#>  3: 1981    A       I1 1178          NA
#>  4: 1981    B       I1 1013          NA
#>  5: 1981    C       I1 1194          NA
#>  6: 1982    A       I1 1425 -2.50000000
#>  7: 1982    B       I1 1305 -2.50000000
#>  8: 1982    C       I1 1117 -2.50000000
#>  9: 1982    D       I1 1298 -2.50000000
#> 10: 1983    B       I1 1228  0.08287293
#> 11: 1983    C       I1 1243  0.08287293
#> 12: 1983    D       I1 1497  0.08287293
#> 13: 1983    E       I1 1373  0.08287293
#> 14: 1983    F       I1 1152  0.08287293

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM