简体   繁体   中英

How to avoid the dummy variable trap when dummy coding several treatments plus control group for linear regression in R

I understand that I need one dummy variable less than the total number of dummy variables. However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R. I found a similar question here: What is causing this error? Coefficients not defined because of singularities but it is slightly different than my problem.

I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", ie 4 combinations. Additionally, I have a Control Group, which was exposed to neither. Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5). Followingly, the dummy coded data looks like this:

                               low benefit  high benefit  short history  long history
Control group                           0             o              0             0
low benefit, short history              1             0              1             0
low benefit, long history               1             0              0             1
high benefit, short history             0             1              1             0
high benefit, long history              0             1              0             1

When I run my lm I get this:

Model: 
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))

Coefficients: (1 not defined because of singularities)
                                         Estimate   Std. Error  t value  Pr(>|t|)
(Intercept)                           5.505398100  0.963932438  5.71139  4.8663e-08 ***
Dummy short history                   0.939025772  0.379091565  2.47704   0.0142196 *
Dummy high benefit                   -0.759944023  0.288192645 -2.63693   0.0091367 **
Dummy long history                    0.759352915  0.389085599  1.95163   0.0526152 .
Dummy low benefit                              NA           NA       NA          NA
Control Varibales                          xxx          xxx        xxx       xxx

This error occurs always for the dummy varibale at the 4th Position. The Control variables are all calculated without problem.

I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low". This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, ie 0 and 0 for both variables.

I am sorry, if this is a basic mistake but I have not been able to figure it out. If you need more information, please say so. Thanks in advance.

As I put in the comments you only have two variables, if you make them factors and check the contrasts r will do the right thing. Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/

Make up data representative of yours.

set.seed(2020)
df <- data.frame(
  X = runif(n = 120, min = 5, max = 15),
  benefit = rep(c("control", "low", "high"), 40),
  history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)

Make benefit and history factors, check that control is the base contrast for each.

df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#>         high low
#> control    0   0
#> high       1   0
#> low        0   1
contrasts(df$history)
#>         long short
#> control    0     0
#> long       1     0
#> short      0     1

Run the regression and get the summary. 4 coefficient all conpared to control/control.

lm(X ~ benefit + history, df)
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Coefficients:
#>  (Intercept)   benefithigh    benefitlow   historylong  historyshort  
#>      9.94474      -0.08721       0.11245       0.37021      -0.35675
summary(lm(X ~ benefit + history, df))
#> 
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -5.4059 -2.3706 -0.0007  2.4986  4.7669 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   9.94474    0.56786  17.513   <2e-16 ***
#> benefithigh  -0.08721    0.62842  -0.139    0.890    
#> benefitlow    0.11245    0.62842   0.179    0.858    
#> historylong   0.37021    0.62842   0.589    0.557    
#> historyshort -0.35675    0.62842  -0.568    0.571    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared:  0.01253,    Adjusted R-squared:  -0.02182 
#> F-statistic: 0.3648 on 4 and 115 DF,  p-value: 0.8333

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM