简体   繁体   中英

Negative prediction values from linear regression in R

So I made a linear regression in R Studio to predict the price of a car based on the year of fabrication. The data set is called "audi" and my linear regression looks like this:

library(tidyverse)
library(modelr)
...
model_price_Year <- lm(data = audi, price ~ year)
summary(model_price_Year)

The result of the summary is this:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.437e+06  8.503e+04  -75.71   <2e-16 
year         3.203e+03  4.215e+01   75.98   <2e-16 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9437 on 10666 degrees of freedom
Multiple R-squared:  0.3512,    Adjusted R-squared:  0.3511 
F-statistic:  5772 on 1 and 10666 DF,  p-value: < 2.2e-16

Then, I made a grid and i added predictions for 100 values of the year. It looks like this:

grid_year <- audi %>%
  data_grid(year = seq_range(year, 100)) %>%
  add_predictions(model_price_Year, "price")

And after that, if i want to see results, they look like this:

  year   price
   <dbl>   <dbl>
 1 1997  -41481.
 2 1997. -40737.
 3 1997. -39993.
 4 1998. -39249.
 5 1998. -38505.
 6 1998. -37761.
 7 1998. -37017.
 8 1999. -36273.
 9 1999. -35529.
10 1999. -34785.

They are all negative, and becuase we are talking about the price, it doesnt really make sense. Why are they negative? How do I interpret this?

See your data!

If you plot price against year you will see that there is no reason to believe a straight line models that relation. I am saying straight line because if you take logarithms of price the regression will still be linear.

suppressPackageStartupMessages({
  library(tidyverse)
  library(modelr)
})

model_price_Year <- lm(price ~ year, data = audi)

grid_year <- audi %>%
  data_grid(
    year = seq_range(year, 100),
    .model = model_price_Year
  ) %>% 
  add_predictions(model_price_Year, "price")

plot(price ~ year, data = audi)
lines(price ~ year, data = grid_year, col = "red", lwd = 2)

Created on 2022-05-09 by the reprex package (v2.0.1)

The red line above will have negative values within the years range.
The solution seems to be to regress log(price) ~ year .
After fitting this model I will plot the fitted line twice, against the log transformation of price and in the original scale.

model_price_Year_2 <- lm(log(price) ~ year, data = audi)

grid_year_2 <- audi %>%
  data_grid(
    year = seq_range(year, 100),
    .model = model_price_Year_2
  ) %>% 
  add_predictions(model_price_Year_2, "log_price")

plot(log(price) ~ year, data = audi)
lines(log_price ~ year, data = grid_year_2, col = "red", lwd = 2)

plot(price ~ year, data = audi)
lines(exp(log_price) ~ year, data = grid_year_2, col = "red", lwd = 2)

Created on 2022-05-09 by the reprex package (v2.0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM