简体   繁体   中英

Assessing glm by seeing how well it describes a different dataset in R

I've created a logistic model using glm with ~10 predictors and a binary response variable. The model was created using a subset of my full dataset (~8000 observation) by randomly selecting 3000 observations, putting these in a new dataset (newdata) and fitting the glm to newdata.

In order to assess the model, I would like to see how well the model describes the data in a different dataset (testdata) which has a random selection of eg ~1000 observations from the full dataset. How would I go about doing this in R?

I have created both confidence intervals for coefficients and looked at Wald-statistics and LRT for assessing statistical significance of my model, but would like to be able to see how well it describes a randomly chosen selection of the full dataset.

Thanks a bunch!

There are several possible approaches. First, to evaluate the model out of sample, you have to pick a performance metric. Say it's MSE, and suppose your test set is called test, then you would use:

mean((test$response - predict(m, newdata = test, type = "response"))^2)

For logistic regression you could calculate the deviance for the logistic family instead of using MSE. Or you could use area under the curve/Gini, which is available in the ROCR package. Also you might want to do cross-validation rather than just one out of sample test, which can be done with cvTools::cvFit .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM