简体   繁体   English

在 R 中使用扫帚和 dplyr 进行“多步”回归

[英]"Multi-step" regression with broom and dplyr in R

I am looking for a way to perform "multi-step" regression with broom and dplyr in R.我正在寻找一种在 R 中使用扫帚和 dplyr 执行“多步”回归的方法。 I use "multi-step" as a placeholder for regression analyses in which you integrate in the final regression model elements of previous regression models, such as the fit or the residuals.我使用“多步”作为回归分析的占位符,您可以在其中集成先前回归模型的最终回归 model 元素,例如拟合或残差。 An example for such a "multi-step" regression would be the 2SLS approach for Instrumental Variable (IV) regression.这种“多步”回归的一个例子是用于工具变量 (IV) 回归的 2SLS 方法。

My (grouped) data looks like this:我的(分组)数据如下所示:

df <- data.frame(
  id = sort(rep(seq(1, 20, 1), 5)),
  group = rep(seq(1, 4, 1), 25),
  y = runif(100),
  x = runif(100),
  z1 = runif(100),
  z2 = runif(100)

where id and group are identifiers, y the dependent variable, while x , z1 and z2 are predictors.其中idgroup是标识符, y是因变量,而xz1z2是预测变量。 In a IV setting x would be an endogenous predictor.在 IV 设置中, x将是一个内生预测因子。

Here is an example for a "multi-step" regression:这是“多步”回归的示例:


# Nest the data frame
df_nested <- df %>% 
  group_by(group) %>% 

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>% 
    fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
    resids = map(fit1, residuals) 

# Run second stage with residuals as control variable
df_fit %>% 
    fit2 = map2(data, resids, ~ tidy(lm(y ~ x + z2 + .y["resids"], data = .x)))
        ) %>% 

This produces an error, which indicates that.x and.y have different lengths.这会产生一个错误,表明.x 和.y 的长度不同。 What is a solution to integrate the residuals, in this attempt the.y["resids"], into the second regression as a control variable?在此尝试中将残差整合到第二个回归中作为控制变量的解决方案是什么?

One option to achieve your desired result would be to add the residuals as a new column to your dataframe after the first stage regression:实现所需结果的一个选项是在第一阶段回归后将残差作为新列添加到 dataframe 中:


# Nest the data frame
df_nested <- df %>% 
  group_by(group) %>% 

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>% 
    fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
    resids = map(fit1, residuals),
    data = map2(data, resids, ~ bind_cols(.x, resids = .y))

# Run second stage with residuals as control variable
df_fit %>% 
    fit2 = map(data, ~ tidy(lm(y ~ x + z2 + resids, data = .x)))
  ) %>% 
#> # A tibble: 16 × 9
#> # Groups:   group [4]
#>    group data        fit1   resids  term    estimate std.error statistic p.value
#>    <dbl> <list>      <list> <list>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>
#>  1     1 <tibble [2… <lm>   <dbl [… (Inter…   0.402      0.524    0.767  0.451  
#>  2     1 <tibble [2… <lm>   <dbl [… x         0.0836     0.912    0.0917 0.928  
#>  3     1 <tibble [2… <lm>   <dbl [… z2        0.161      0.250    0.644  0.527  
#>  4     1 <tibble [2… <lm>   <dbl [… resids   -0.0536     0.942   -0.0569 0.955  
#>  5     2 <tibble [2… <lm>   <dbl [… (Inter…   0.977      0.273    3.58   0.00175
#>  6     2 <tibble [2… <lm>   <dbl [… x        -0.561      0.459   -1.22   0.235  
#>  7     2 <tibble [2… <lm>   <dbl [… z2       -0.351      0.192   -1.82   0.0826 
#>  8     2 <tibble [2… <lm>   <dbl [… resids    0.721      0.507    1.42   0.170  
#>  9     3 <tibble [2… <lm>   <dbl [… (Inter…  -0.710      1.19    -0.598  0.556  
#> 10     3 <tibble [2… <lm>   <dbl [… x         3.61       3.80     0.951  0.352  
#> 11     3 <tibble [2… <lm>   <dbl [… z2       -1.21       1.19    -1.01   0.323  
#> 12     3 <tibble [2… <lm>   <dbl [… resids   -3.67       3.80    -0.964  0.346  
#> 13     4 <tibble [2… <lm>   <dbl [… (Inter…  59.6       40.1      1.49   0.152  
#> 14     4 <tibble [2… <lm>   <dbl [… x       -83.4       56.5     -1.48   0.155  
#> 15     4 <tibble [2… <lm>   <dbl [… z2      -18.7       12.8     -1.45   0.160  
#> 16     4 <tibble [2… <lm>   <dbl [… resids   83.4       56.5      1.48   0.155

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM