简体   繁体   English

在 R 中对多个处理加对照组进行线性回归的虚拟编码时如何避免虚拟变量陷阱

[英]How to avoid the dummy variable trap when dummy coding several treatments plus control group for linear regression in R

I understand that I need one dummy variable less than the total number of dummy variables.我知道我需要的虚拟变量比虚拟变量的总数少一个。 However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R.但是,我被卡住了,因为在 R 中运行 lm 时,我不断收到错误消息:“1 由于奇异性而未定义”。 I found a similar question here: What is causing this error?我在这里发现了一个类似的问题: 是什么导致了这个错误? Coefficients not defined because of singularities but it is slightly different than my problem. 由于奇点而未定义系数,但它与我的问题略有不同。

I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", ie 4 combinations.我有两种处理方法(1)“收益”和(2)“历史”,每个有两个级别(1)“低”和“高”以及(2)“短”和“长”,即 4 种组合。 Additionally, I have a Control Group, which was exposed to neither.此外,我有一个对照组,这两个组都没有接触过。 Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5).因此,我编码了 4 个虚拟变量(比组 n=5 的总数少一个)。 Followingly, the dummy coded data looks like this:接下来,虚拟编码数据如下所示:

                               low benefit  high benefit  short history  long history
Control group                           0             o              0             0
low benefit, short history              1             0              1             0
low benefit, long history               1             0              0             1
high benefit, short history             0             1              1             0
high benefit, long history              0             1              0             1

When I run my lm I get this:当我运行我的 lm 我得到这个:

Model: 
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))

Coefficients: (1 not defined because of singularities)
                                         Estimate   Std. Error  t value  Pr(>|t|)
(Intercept)                           5.505398100  0.963932438  5.71139  4.8663e-08 ***
Dummy short history                   0.939025772  0.379091565  2.47704   0.0142196 *
Dummy high benefit                   -0.759944023  0.288192645 -2.63693   0.0091367 **
Dummy long history                    0.759352915  0.389085599  1.95163   0.0526152 .
Dummy low benefit                              NA           NA       NA          NA
Control Varibales                          xxx          xxx        xxx       xxx

This error occurs always for the dummy varibale at the 4th Position.此错误总是发生在第 4 个 Position 的虚拟变量上。 The Control variables are all calculated without problem.控制变量的计算都没有问题。

I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low".我已经尝试只包含两个具有两个级别的变量,这意味着我编码的“历史”,1 表示“长”,0 表示“短”,以及“收益”,1 表示“高”,0 表示“低”。 This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, ie 0 and 0 for both variables.这样,lm 起作用了,但问题是,控制组和组合“历史短,收益低”的编码相同,即两个变量都为 0 和 0。

I am sorry, if this is a basic mistake but I have not been able to figure it out.对不起,如果这是一个基本错误,但我无法弄清楚。 If you need more information, please say so.如果您需要更多信息,请说出来。 Thanks in advance.提前致谢。

As I put in the comments you only have two variables, if you make them factors and check the contrasts r will do the right thing.正如我在评论中所说,你只有两个变量,如果你把它们作为因素并检查对比r会做正确的事情。 Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/另请参阅http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/

Make up data representative of yours.组成代表您的数据。

set.seed(2020)
df <- data.frame(
  X = runif(n = 120, min = 5, max = 15),
  benefit = rep(c("control", "low", "high"), 40),
  history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)

Make benefit and history factors, check that control is the base contrast for each.制作benefithistory因素,检查控制是否是每个因素的基础对比。

df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#>         high low
#> control    0   0
#> high       1   0
#> low        0   1
contrasts(df$history)
#>         long short
#> control    0     0
#> long       1     0
#> short      0     1

Run the regression and get the summary.运行回归并获得摘要。 4 coefficient all conpared to control/control. 4个系数都与控制/控制相比。

lm(X ~ benefit + history, df)
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Coefficients:
#>  (Intercept)   benefithigh    benefitlow   historylong  historyshort  
#>      9.94474      -0.08721       0.11245       0.37021      -0.35675
summary(lm(X ~ benefit + history, df))
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.4059 -2.3706 -0.0007  2.4986  4.7669 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.94474    0.56786  17.513   <2e-16 ***
#> benefithigh  -0.08721    0.62842  -0.139    0.890    
#> benefitlow    0.11245    0.62842   0.179    0.858    
#> historylong   0.37021    0.62842   0.589    0.557    
#> historyshort -0.35675    0.62842  -0.568    0.571    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared:  0.01253,    Adjusted R-squared:  -0.02182 
#> F-statistic: 0.3648 on 4 and 115 DF,  p-value: 0.8333

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM