简体   繁体   中英

Multiple Linear Regression using lm() in r output with categorical variable is incomplete?

I am working with this dataset https://archive.ics.uci.edu/ml/datasets/automobile . There is a categorical variable called 'num.of.doors' which can be (two, four) which is giving me trouble.

when I run lm(formula = price ~ horsepower + num.of.doors, data = train.sample) to predict prices, the output is:

Call: lm(formula = price ~ horsepower + num.of.doors, data = train.sample)

Coefficients: (Intercept) horsepower num.of.doors two
-4006.5 174.1 -1856.2

But I would like to see the coefficient of num.of.doors for four doors. How do I do that?

If there are only two values for num.of.doors ("two" and "four") then the coefficient of "four" in this model is 0.

Your formula is price = -4006.6 + 174.1(horsepower)-1856.2(num.of.doors = Two)

So your price if the car has four doors is simply: price = -4006.6 + 174.1*horsepower

This happens because the variables are categorical. It can be interpreted as "if the car has two doors instead of four, my model estimates the price will be 1856.20 LESS than a car with four doors."

i think that your problem is because when you adjust a linear regression with a categorical variable, one of the categories will be used as a "reference" and its value will be represented (be part) in the "intercept".

So, to see the coefficient for "four doors" you should change the base reference of your variable. You can do this with:

train_sample$num.of.doors = relevel(train_sample$num.of.doors, ref=2)

Keep in mind that with this change, "two doors" will now be the base reference.

There are a couple of options:

1 - Convert num.of.doors to a factor, and recode it to make two doors the base level. Once you run the lm command, it will show the coefficient for four doors in the linear regression. This can be achieved as follows:

library(tidyverse)
new_train_sample <- train.sample %>%
                  mutate(num.of.doors = factor(num.of.doors, levels = c("two", "four")))

lm_1 <- lm(formula = price ~ horsepower + num.of.doors, data = new_train_sample)
summary(lm_1)

2 - Perform a regression through the origin. This will make both coefficients available with the same lm command, but the interpretation of the intercept will change slightly. This will not affect predictions or the magnitude of the coefficients.

lm_origin <- lm(formula = price ~ 0 + horsepower + num.of.doors, data = train.sample)
summary(lm_origin)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM