简体   繁体   中英

predict.lm with newdata

I've built an lm model without using the data= parameter:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                            gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

Now I'd like to predict m1 using newdata and so name my new data.frame to match the variables used in the lm() call above.

With newComps as my new gc.pc (which, like the gc.tA prediction, were predicted using the new data.frame without any issues), I've tried

newD <- data.frame( newComps[1:100,1:6] ,
                    predict(gc.tA , newdata = mdldvlp[1:100,predKept]))


names(newD) <- names(m1$coefficients)[-1]
names(newD) <- names(m1$model)[-1]

names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" , 
                  "gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" , 
                  "predict(gc.tA)" )
names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" , 
                  "gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" , 
                  "predict(gc.tA)" )

Unfortunately, predict.lm does not accept the naming strategies above and returns the dreaded newdata warning along with the predictions from the original data.frame that built m1 :

Warning message:
'newdata' had 100 rows but variable(s) found have 1414 rows  

How should I name the newD columns to make the predict call work? Thanks.

The code below recreates the issue:

    require(rpart)

    set.seed(123)
    X <- matrix(runif(200) , 20 , 10)
    gc.pc <- princomp(X)
    y <- runif(20)
    mdldvlp.trim <- data.frame(y,X)
    names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
    predKept <- paste("x",1:10,sep="")

    gc.tA <- rpart( y ~ . , data = mdldvlp.trim)

    m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + 
                                gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

    mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
    names(mdldvlp) <- predKept

    newComps <- predict( gc.pc , newdata=mdldvlp )

    newD <- data.frame( newComps[1:100,1:6] ,
                        predict(gc.tA , newdata = mdldvlp[1:100,predKept]))

# enter newD naming strategy here

    predict( m1 , newdata=newD )

4/20 Follow up:

Thanks all for your answers. I understand things would be easier by first creating a data.frame with properly named predictors. I understand that. My question is if the modeling data frame does indeed evaluate to a data frame with variables named gc.pc$scores[,1] etc. then why won't the naming 'strategies' used above work with predict.lm ? In other words, does lm really evaluate its modeling data frame with gc.pc$scores[,1] and so on? If it did, wouldn't the renaming strategies above work in predict.lm ?

You are abusing the formula notation and it is this that is causing you problems. Essentially your formula:

m1 <- lm( mdldvlp.trim$y ~  gc.pc$scores[,1] + gc.pc$scores[,2] + 
                            gc.pc$scores[,3] + gc.pc$scores[,4] + 
                            gc.pc$scores[,5] + gc.pc$scores[,6] + 
                            predict(gc.tA))

will evaluate to a data frame with variables named gc.pc$scores[,1] etc. When you use predict() it will look for variables with these same names in the object passed to the newdata argument.

Ideally, you'd create a data object with all the variables you want included in them with appropriate names, eg:

fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")

and then fit the model via:

m1 <- lm(trimY ~ ., data = fitData)

New predictions can be made from the model by providing a data frame with the same names as used to fit the model. Hence using your newD :

newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")

and then predict()

predict(m1 , newdata=newD)

Full example

require(rpart)

set.seed(123)
X <- matrix(runif(200) , 20 , 10)
gc.pc <- princomp(X)
y <- runif(20)
mdldvlp.trim <- data.frame(y,X)
names(mdldvlp.trim) <- c("y",paste("x",1:10,sep=""))
predKept <- paste("x",1:10,sep="")

gc.tA <- rpart( y ~ . , data = mdldvlp.trim)
fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA))
names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")
m1 <- lm(trimY ~ ., data = fitData)
mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10))
names(mdldvlp) <- predKept

newComps <- predict( gc.pc , newdata=mdldvlp )
newD <- data.frame(newComps[1:100,1:6] ,
                   predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")
predict(m1 , newdata=newD)

I've had a similar issue in the past - I think I resolved it by giving my variables names instead of referring to a column number. eg Don't use gc.pc[,1], but convert the gc.pc matrix to a dataframe and add names to the columns ("PC1", "PC2", ... etc.). Then make sure that your newdata also uses these names (in a dataframe as well).

I had a similar issue. If my data frame had three or more variables (one outcome and two or more prediction variables), I had no problems when referring to columns by their column number. But, when my data frame had only two variables (one outcome, one predictor), R gave me lots of errors, including 'newdata' had 1 row but variables found have xx rows

Following Marc in the box's suggestion, I wrote a special case for instances in which the data frame has only two variables, and assigned variable names. This fixed my issue.

To fix my this warning, I rewrote:

lr <- lm(train[ , ncol(train)] ~ ., data = train[ , -ncol(train)])

as:

if(ncol(train) == 2) {
    colnames(train) <- c('var1','var2')
    colnames(test) <- c('var1','var2')
    lr <- lm(var2 ~ var1, data = train)
} else if (ncol(train) > 2) {
    lr <- lm(train[ , ncol(train)] ~ ., data = train[ , -ncol(train)])
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM