简体   繁体   中英

Why can't I predict one single value of MPG given one value of horsepower using R's predict() function?

Using ISLR's Auto dataset and the following code:

lm.fit <- lm(Auto$mpg ~ Auto$horsepower)
predict(lm.fit, newdata = data.frame(horsepower=100))
predict(lm.fit, data.frame(horsepower=(c(100))), interval="confidence")

I get errors saying something like:

Warning message: 'newdata' had 1 row but variables found have 392 rows

How can I fix this?

I have no idea why this fails:

lm.fit <- lm(Auto$mpg ~ Auto$horsepower)
predict(lm.fit, newdata = data.frame(horsepower=100))

but the standard way of doing this is to give the formula in terms of the data and include the data as an argument:

lm.fit <- lm(mpg ~ horsepower, data=Auto)
predict(lm.fit, newdata=data.frame(horsepower=100))

should work. I've not got that data set so here's it on a tiny example:

> x=runif(100)
> y=runif(100)
> d = data.frame(x=x,y=y)
> m = lm(y~x, data=d)
> predict(m, newdata=data.frame(x=10))
       1 
0.454481 

But do it this way and bad things happen:

> m2 = lm(d$y~d$x)
> predict(m2, newdata=data.frame(x=10))
        1         2         3         4         5         6         7         8 
0.4699471 0.4686431 0.4687603 0.4691200 0

The underlying reason why you shouldn't use something like lm(data$y ~ data$whatever) is that this stores a hard-coded reference to the columns in your training dataset. Rather than using the Auto dataset, let's use the mtcars dataset which comes with R as an example.

Let's fit a model the wrong way:

m <- lm(mtcars$mpg ~ mtcars$wt)

After doing this, the model's terms component now refers specifically to mtcars$mpg and mtcars$wt rather than variables mpg and wt :

m$terms
# mtcars$mpg ~ mtcars$wt
# attr(,"variables")
# list(mtcars$mpg, mtcars$wt)
# attr(,"factors")
#            mtcars$wt
# mtcars$mpg         0
# mtcars$wt          1
# ...

Now try to predict using this model:

predict(m, newdata=data.frame(wt=4))
#         1         2         3         4         5         6         7         8         9        10        11        12        13 
# 23.282611 21.919770 24.885952 20.102650 18.900144 18.793255 18.205363 20.236262 20.450041 18.900144 18.900144 15.533127 17.350247 
#        14        15        16        17        18        19        20        21        22        23        24        25        26 
# 17.083024  9.226650  8.296712  8.718926 25.527289 28.653805 27.478021 24.111004 18.472586 18.926866 16.762355 16.735633 26.943574 
#        27        28        29        30        31        32 
# 25.847957 29.198941 20.343151 22.480940 18.205363 22.427495 
# Warning message:
# 'newdata' had 1 row but variables found have 32 rows 

What happened? Rather than looking for a variable called wt , the predict method is looking for something called mtcars$wt . There is nothing of this sort in your newdata , so as a fallback it looks in the global environment (technically, it tries to evaluate the expression mtcars$wt first in the environment of newdata , and then in the environment where the model was fitted, which is the global environment). This succeeds, and in fact resolves to the original column of data that we used to fit the model. Because of this, the newdata argument is essentially ignored.

Now if we fit the model the correct way via

m2 <- lm(mpg ~ wt, data=mtcars)

This will store the variable names mpg and wt in the model, and the name lookup will work as intended:

predict(m2, newdata=data.frame(wt=4))
#        1 
# 15.90724 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM