在 R 中对多个处理加对照组进行线性回归的虚拟编码时如何避免虚拟变量陷阱

Question

我知道我需要的虚拟变量比虚拟变量的总数少一个。 但是，我被卡住了，因为在 R 中运行 lm 时，我不断收到错误消息：“1 由于奇异性而未定义”。 我在这里发现了一个类似的问题：是什么导致了这个错误？ 由于奇点而未定义系数，但它与我的问题略有不同。

我有两种处理方法（1）“收益”和（2）“历史”，每个有两个级别（1）“低”和“高”以及（2）“短”和“长”，即 4 种组合。 此外，我有一个对照组，这两个组都没有接触过。 因此，我编码了 4 个虚拟变量（比组 n=5 的总数少一个）。 接下来，虚拟编码数据如下所示：

                               low benefit  high benefit  short history  long history
Control group                           0             o              0             0
low benefit, short history              1             0              1             0
low benefit, long history               1             0              0             1
high benefit, short history             0             1              1             0
high benefit, long history              0             1              0             1

当我运行我的 lm 我得到这个：

Model: 
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))

Coefficients: (1 not defined because of singularities)
                                         Estimate   Std. Error  t value  Pr(>|t|)
(Intercept)                           5.505398100  0.963932438  5.71139  4.8663e-08 ***
Dummy short history                   0.939025772  0.379091565  2.47704   0.0142196 *
Dummy high benefit                   -0.759944023  0.288192645 -2.63693   0.0091367 **
Dummy long history                    0.759352915  0.389085599  1.95163   0.0526152 .
Dummy low benefit                              NA           NA       NA          NA
Control Varibales                          xxx          xxx        xxx       xxx

此错误总是发生在第 4 个 Position 的虚拟变量上。 控制变量的计算都没有问题。

我已经尝试只包含两个具有两个级别的变量，这意味着我编码的“历史”，1 表示“长”，0 表示“短”，以及“收益”，1 表示“高”，0 表示“低”。 这样，lm 起作用了，但问题是，控制组和组合“历史短，收益低”的编码相同，即两个变量都为 0 和 0。

对不起，如果这是一个基本错误，但我无法弄清楚。 如果您需要更多信息，请说出来。 提前致谢。

Answer 1

正如我在评论中所说，你只有两个变量，如果你把它们作为因素并检查对比r会做正确的事情。 另请参阅http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/

组成代表您的数据。

set.seed(2020)
df <- data.frame(
  X = runif(n = 120, min = 5, max = 15),
  benefit = rep(c("control", "low", "high"), 40),
  history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)

制作benefit和history因素，检查控制是否是每个因素的基础对比。

df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#>         high low
#> control    0   0
#> high       1   0
#> low        0   1
contrasts(df$history)
#>         long short
#> control    0     0
#> long       1     0
#> short      0     1

运行回归并获得摘要。 4个系数都与控制/控制相比。

lm(X ~ benefit + history, df)
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Coefficients:
#>  (Intercept)   benefithigh    benefitlow   historylong  historyshort  
#>      9.94474      -0.08721       0.11245       0.37021      -0.35675
summary(lm(X ~ benefit + history, df))
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.4059 -2.3706 -0.0007  2.4986  4.7669 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.94474    0.56786  17.513   <2e-16 ***
#> benefithigh  -0.08721    0.62842  -0.139    0.890    
#> benefitlow    0.11245    0.62842   0.179    0.858    
#> historylong   0.37021    0.62842   0.589    0.557    
#> historyshort -0.35675    0.62842  -0.568    0.571    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared:  0.01253,    Adjusted R-squared:  -0.02182 
#> F-statistic: 0.3648 on 4 and 115 DF,  p-value: 0.8333

在 R 中对多个处理加对照组进行线性回归的虚拟编码时如何避免虚拟变量陷阱

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-08-10 15:09:35

在 R 中对多个处理加对照组进行线性回归的虚拟编码时如何避免虚拟变量陷阱

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-08-10 15:09:35

解决方案1
0 已采纳 2020-08-10 15:09:35