[英]How to avoid the dummy variable trap when dummy coding several treatments plus control group for linear regression in R
I understand that I need one dummy variable less than the total number of dummy variables.我知道我需要的虚拟变量比虚拟变量的总数少一个。 However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R.
但是,我被卡住了,因为在 R 中运行 lm 时,我不断收到错误消息:“1 由于奇异性而未定义”。 I found a similar question here: What is causing this error?
我在这里发现了一个类似的问题: 是什么导致了这个错误? Coefficients not defined because of singularities but it is slightly different than my problem.
由于奇点而未定义系数,但它与我的问题略有不同。
I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", ie 4 combinations.我有两种处理方法(1)“收益”和(2)“历史”,每个有两个级别(1)“低”和“高”以及(2)“短”和“长”,即 4 种组合。 Additionally, I have a Control Group, which was exposed to neither.
此外,我有一个对照组,这两个组都没有接触过。 Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5).
因此,我编码了 4 个虚拟变量(比组 n=5 的总数少一个)。 Followingly, the dummy coded data looks like this:
接下来,虚拟编码数据如下所示:
low benefit high benefit short history long history
Control group 0 o 0 0
low benefit, short history 1 0 1 0
low benefit, long history 1 0 0 1
high benefit, short history 0 1 1 0
high benefit, long history 0 1 0 1
When I run my lm I get this:当我运行我的 lm 我得到这个:
Model:
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.505398100 0.963932438 5.71139 4.8663e-08 ***
Dummy short history 0.939025772 0.379091565 2.47704 0.0142196 *
Dummy high benefit -0.759944023 0.288192645 -2.63693 0.0091367 **
Dummy long history 0.759352915 0.389085599 1.95163 0.0526152 .
Dummy low benefit NA NA NA NA
Control Varibales xxx xxx xxx xxx
This error occurs always for the dummy varibale at the 4th Position.此错误总是发生在第 4 个 Position 的虚拟变量上。 The Control variables are all calculated without problem.
控制变量的计算都没有问题。
I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low".我已经尝试只包含两个具有两个级别的变量,这意味着我编码的“历史”,1 表示“长”,0 表示“短”,以及“收益”,1 表示“高”,0 表示“低”。 This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, ie 0 and 0 for both variables.
这样,lm 起作用了,但问题是,控制组和组合“历史短,收益低”的编码相同,即两个变量都为 0 和 0。
I am sorry, if this is a basic mistake but I have not been able to figure it out.对不起,如果这是一个基本错误,但我无法弄清楚。 If you need more information, please say so.
如果您需要更多信息,请说出来。 Thanks in advance.
提前致谢。
As I put in the comments you only have two variables, if you make them factors and check the contrasts r
will do the right thing.正如我在评论中所说,你只有两个变量,如果你把它们作为因素并检查对比
r
会做正确的事情。 Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/另请参阅http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/
Make up data representative of yours.组成代表您的数据。
set.seed(2020)
df <- data.frame(
X = runif(n = 120, min = 5, max = 15),
benefit = rep(c("control", "low", "high"), 40),
history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)
Make benefit
and history
factors, check that control is the base contrast for each.制作
benefit
和history
因素,检查控制是否是每个因素的基础对比。
df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#> high low
#> control 0 0
#> high 1 0
#> low 0 1
contrasts(df$history)
#> long short
#> control 0 0
#> long 1 0
#> short 0 1
Run the regression and get the summary.运行回归并获得摘要。 4 coefficient all conpared to control/control.
4个系数都与控制/控制相比。
lm(X ~ benefit + history, df)
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Coefficients:
#> (Intercept) benefithigh benefitlow historylong historyshort
#> 9.94474 -0.08721 0.11245 0.37021 -0.35675
summary(lm(X ~ benefit + history, df))
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4059 -2.3706 -0.0007 2.4986 4.7669
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.94474 0.56786 17.513 <2e-16 ***
#> benefithigh -0.08721 0.62842 -0.139 0.890
#> benefitlow 0.11245 0.62842 0.179 0.858
#> historylong 0.37021 0.62842 0.589 0.557
#> historyshort -0.35675 0.62842 -0.568 0.571
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared: 0.01253, Adjusted R-squared: -0.02182
#> F-statistic: 0.3648 on 4 and 115 DF, p-value: 0.8333
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.