Predicting with lm object in R - black box paradigm

Question

I have a function that returns an lm object. I want to produce predicted values based on some new data. The new data is a data.frame in the exact format as the data passed to the lm function, except that the response has been removed (since we're predicting, not training). I would expect to execute the following, but get an error:

predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"

In my case, ModelResponse was the name of the response column in the data I originally trained on. So just for kicks, I tried to insert NA reponse:

newdata$ModelResponse = NA
predict( model , newdata )
Error in terms.default(object, data = data) : no terms component nor attribute

Highly frustrating: R's notion of models/regression doesn't match mine. 1. I train a model with some data and get a model object. 2. I can score new data from any environment/function/frame/etc. so long as I input data into the model object that "looks like" the data I trained on (ie same column names). This is a standard black-box paradigm.

So here are my questions:
1. What concept(s) am I missing here?
2. How do I get my scenario to work?
3. How can I get model object to be portable? str(model) shows me that the model object saved the original data it trained on. So the model object is massive. I want my model to be portable to any function/environment/etc. and only contain the data it needs to score.

Answer 1

In the absence of str() on either the model or the data offered to the model, here's my guess regarding this error message:

predict( model , newdata )
"Error in eval(expr, envir, enclos) : object 'ModelResponse' not found"

I guess that you made a model object named "model" and that your outcome variable (the left-hand-side of the formula( in the original call to lm was named "ModelResponse" and that you then named a column in newdata by the same name. But what you should have done was leave out the "ModelResponse" columns (because that is what you are predicting) and put in the "Model_Predictor1", Model_Predictor2", etc. ... ie all the names on the right-hand-side of the formula given to lm()

The coef() function will allow you to extract the information needed to make the model portable.

mod.coef <- coef(model)
mod.coef

Since you expressed interest in the rms/Hmisc package combo Function , here it is using the help-example from ols and comparing the output with an extracted function and the rms Predict method. Note the capitals, since these are designed to work with the package equivalents of lm and glm(..., family="binomial") and coxph , which in rms become ols , lrm , and cph .

> set.seed(1)
> x1 <- runif(200)
> x2 <- sample(0:3, 200, TRUE)
> distance <- (x1 + x2/3 + rnorm(200))^2
> d <- datadist(x1,x2)
> options(datadist="d")   # No d -> no summary, plot without giving all details
> 
> 
> f <- ols(sqrt(distance) ~ rcs(x1,4) + scored(x2), x=TRUE)
> 
> Function(f)
function(x1 = 0.50549065,x2 = 1) {0.50497361+1.0737604* x1- 
   0.79398383*pmax(x1-0.083887788,0)^3+ 1.4392827*pmax(x1-0.38792825,0)^3-  
   0.38627901*pmax(x1-0.65115162,0)^3-0.25901986*pmax(x1-0.92736774,0)^3+ 
   0.06374433*x2+ 0.60885222*(x2==2)+0.38971577*(x2==3) }
<environment: 0x11b4568e8>
> ols.fun <- Function(f)
> pred1 <- Predict(f, x1=1, x2=3)
> pred1
  x1 x2     yhat    lower    upper
1  1  3 1.862754 1.386107 2.339401

Response variable (y): sqrt(distance) 

Limits are 0.95 confidence limits
# The "yhat" is the same as one produces with the extracted function
> ols.fun(x1=1, x2=3)
[1] 1.862754

(I have learned through experience that the restricted cubic-spline fit functions coming from rms need to have spaces and carriage returns added to improve readability. )

Answer 2

Thinking long-term, you should probably take a look at the caret package. Many or most modeling functions work with data frames and matrices, others have a preference, and there may be other variations of their expectations. It's important to quickly get your head around each, but if you want a single wrapper that will simplify life for you, making the intricacies into a "black box", then caret is as close as you can get.

As a disclaimer: I do not use caret , as I don't think modeling should be a be a black box. I've had more than a few emails to maintainers of modeling packages resulting from looking into their code and seeing something amiss. Wrapping that in another layer would not serve my interests. So, in the very long-run, avoid caret and develop an enjoyment for dissecting what's going into and out of the different modeling functions. :)

Predicting with lm object in R - black box paradigm

Question

2 answers

solution1
5 ACCPTED 2011-08-17 18:08:11

solution2
2 2011-08-17 18:25:23

Predicting with lm object in R - black box paradigm

Question

2 answers

solution1 5 ACCPTED 2011-08-17 18:08:11

solution2 2 2011-08-17 18:25:23

solution1
5 ACCPTED 2011-08-17 18:08:11

solution2
2 2011-08-17 18:25:23