简体   繁体   中英

Regression models with categorical variable: dummy code or convert to factor

I know that this might be a little bit of a silly question, but the main reason that I want to ask is because I have been taught DUMMY CODE! DUMMY CODE! DUMMY CODE! By multiple teachers in multiple classes all using R.

So I did this comparison on the Auto data set in the ISLR package.

library(ISLR)
Auto$c3 <- ifelse(Auto$cylinders == 3, 1, 0)
Auto$c4 <- ifelse(Auto$cylinders == 4, 1, 0)
Auto$c5 <- ifelse(Auto$cylinders == 5, 1, 0)
Auto$c6 <- ifelse(Auto$cylinders == 6, 1, 0)
Auto$c8 <- ifelse(Auto$cylinders == 8, 1, 0)
Auto$cylinders <- as.factor(Auto$cylinders)

summary(lm(mpg~displacement + cylinders, data = Auto))
summary(lm(mpg~displacement + c4 + c5 + c6 + c8, data = Auto))

Call:
lm(formula = mpg ~ displacement + cylinders, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.692  -2.694  -0.347   2.157  20.307 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.33811    2.25278   10.80  < 2e-16 ***
displacement -0.05225    0.00693   -7.54  3.3e-13 ***
cylinders4   10.67609    2.23296    4.78  2.5e-06 ***
cylinders5   10.60478    3.39198    3.13   0.0019 ** 
cylinders6    7.04473    2.46493    2.86   0.0045 ** 
cylinders8    8.65170    2.92786    2.95   0.0033 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.39 on 386 degrees of freedom
Multiple R-squared:  0.687, Adjusted R-squared:  0.683 
F-statistic:  170 on 5 and 386 DF,  p-value: <2e-16

> summary(lm(mpg~displacement + c4 + c5 + c6 + c8, data = Auto))

Call:
lm(formula = mpg ~ displacement + c4 + c5 + c6 + c8, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.692  -2.694  -0.347   2.157  20.307 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  24.33811    2.25278   10.80  < 2e-16 ***
displacement -0.05225    0.00693   -7.54  3.3e-13 ***
c4           10.67609    2.23296    4.78  2.5e-06 ***
c5           10.60478    3.39198    3.13   0.0019 ** 
c6            7.04473    2.46493    2.86   0.0045 ** 
c8            8.65170    2.92786    2.95   0.0033 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.39 on 386 degrees of freedom
Multiple R-squared:  0.687, Adjusted R-squared:  0.683 
F-statistic:  170 on 5 and 386 DF,  p-value: <2e-16

Both produce the same output, which in my head is not surprising. The thing that does surprise me is the fact that I have been taught to dummy code instead of converting to factor. Is there any analytical, computational, or any reason at all to dummy code over using a factor variable? Using a factor seems so much easier, requires less code, and you don't end up with a bunch of extra variables. The only possible advantage of dummy coding that I can see versus using a factor is that you can select your reference group, which I'm guessing you can probably do with a factor too.

Dummy coding can be done easily using dummies package.

library(dummies)

#sample data
auto <- tail(ISLR::Auto,10)

#dummy coding
auto_dummyCoded <- cbind(auto, dummy(c("cylinders"), data=auto))
auto_dummyCoded

In above dummy coding, two new variables are added (ie cylinders4 , cylinders6 ) as there are two cylinders categories in the sample data.


Now instead of dummy coding let's convert cylinders column to "factor" before passing it to lm

auto$cylinders <- as.factor(auto$cylinders)
fit <- lm(mpg ~ cylinders, data=auto, x=T)

Let's print fit$x to see how cylinders column was coded internally. R has converted cylinders column as cylinders6 and one constant column intercept (which is one less than the number of categories available in "cylinders" column along with one extra constant variable. Just an alternative way of dummy coding!)

    (Intercept) cylinders6
388           1          0
389           1          1
390           1          0
391           1          0
392           1          0
393           1          0
394           1          0
395           1          0
396           1          0
397           1          0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM