简体   繁体   中英

R GLM function omitting data

I'm creating a logistic regression model predicting a factored binary outcome variable (yes/no), but am running into a weird issue with missing data. Basically, I receive a very different R-squared when I manually filter observations out of the model prior to running the GLM function compared to letting GLM perform its own na.action. See below for sample code:

outcome <- rnorm(100)
outcome <- ifelse(outcome <= 0.5, 0, 1)
var1 <- rnorm(100)
var2 <- rnorm(100)
var3 <- c(rnorm(88), NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(cbind(outcome, var1, var2, var3))
df$outcome <- factor(df$outcome)

model_1 <- glm(outcome ~., data = df, family = "binomial")
nagelkerke(model_1)

Outcome of model_1:

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.160916
Cox and Snell (ML)                   0.192093
Nagelkerke (Cragg and Uhler)         0.261581

Now I tried filtering out the cases beforehand and receive a completely different R-squared:

df_clean <- filter(df, is.na(var3) == FALSE)

model_2 <- glm(outcome ~., data = df_clean, family = "binomial")
nagelkerke(model_2)

Outcome of model_2:

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                            0.0110171
Cox and Snell (ML)                  0.0123142
Nagelkerke (Cragg and Uhler)        0.0182368

Why is this the case, given that GLM's default na.action = na.omit (which I interpret as omitting cases with missing values)? Isn't this essentially the same thing as filtering out these cases beforehand and then running the model?

Also, I tried changing the na.action to "na.omit" and "na.exclude" and receive the same outputs. Thanks for your help!

You are correct in that na.omit will omit the missing values and run your model. In fact, you should see identical outputs when you run summary(model_1) and summary(model_2) .

However, the nagelkerke function that you are using runs into issues when there are NA values in one variable from the original dataset. From there documentation ...

The fitted model and the null model should be properly nested. That is, the terms of one need to be a subset of the the other, and they should have the same set of observations. One issue arises when there are NA values in one variable but not another, and observations with NA are removed in the model fitting. The result may be fitted and null models with different sets of observations. Setting restrictNobs to TRUE ensures that only observations in the fit model are used in the null model. This appears to work for lm and some glm models, but causes the function to fail for other model object types

If you set restrictNobs to TRUE you should see the same output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM