简体   繁体   English

R GLM 函数省略数据

[英]R GLM function omitting data

I'm creating a logistic regression model predicting a factored binary outcome variable (yes/no), but am running into a weird issue with missing data.我正在创建一个逻辑回归模型来预测一个因式分解的二元结果变量(是/否),但是我遇到了一个奇怪的数据丢失问题。 Basically, I receive a very different R-squared when I manually filter observations out of the model prior to running the GLM function compared to letting GLM perform its own na.action.基本上,与让 GLM 执行自己的 na.action 相比,当我在运行 GLM 函数之前从模型中手动过滤观察值时,我会收到非常不同的 R 平方。 See below for sample code:请参阅下面的示例代码:

outcome <- rnorm(100)
outcome <- ifelse(outcome <= 0.5, 0, 1)
var1 <- rnorm(100)
var2 <- rnorm(100)
var3 <- c(rnorm(88), NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(cbind(outcome, var1, var2, var3))
df$outcome <- factor(df$outcome)

model_1 <- glm(outcome ~., data = df, family = "binomial")
nagelkerke(model_1)

Outcome of model_1:模型_1的结果:

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                             0.160916
Cox and Snell (ML)                   0.192093
Nagelkerke (Cragg and Uhler)         0.261581

Now I tried filtering out the cases beforehand and receive a completely different R-squared:现在我尝试预先过滤掉这些案例并得到一个完全不同的 R 平方:

df_clean <- filter(df, is.na(var3) == FALSE)

model_2 <- glm(outcome ~., data = df_clean, family = "binomial")
nagelkerke(model_2)

Outcome of model_2: model_2 的结果:

$Pseudo.R.squared.for.model.vs.null
                             Pseudo.R.squared
McFadden                            0.0110171
Cox and Snell (ML)                  0.0123142
Nagelkerke (Cragg and Uhler)        0.0182368

Why is this the case, given that GLM's default na.action = na.omit (which I interpret as omitting cases with missing values)?考虑到 GLM 的默认 na.action = na.omit(我将其解释为忽略缺失值的情况),为什么会这样? Isn't this essentially the same thing as filtering out these cases beforehand and then running the model?这与事先过滤掉这些案例然后运行模型本质上不是一回事吗?

Also, I tried changing the na.action to "na.omit" and "na.exclude" and receive the same outputs.另外,我尝试将 na.action 更改为“na.omit”和“na.exclude”并接收相同的输出。 Thanks for your help!谢谢你的帮助!

You are correct in that na.omit will omit the missing values and run your model.您是正确的, na.omit将省略缺失值并运行您的模型。 In fact, you should see identical outputs when you run summary(model_1) and summary(model_2) .事实上,当您运行summary(model_1)summary(model_2)时,您应该看到相同的输出。

However, the nagelkerke function that you are using runs into issues when there are NA values in one variable from the original dataset.但是,当原始数据集中的一个变量中存在 NA 值时,您正在使用的nagelkerke函数会遇到问题。 From there documentation ...从那里文档...

The fitted model and the null model should be properly nested.拟合模型和空模型应该正确嵌套。 That is, the terms of one need to be a subset of the the other, and they should have the same set of observations.也就是说,一个的项需要是另一个的子集,并且它们应该具有相同的观察集。 One issue arises when there are NA values in one variable but not another, and observations with NA are removed in the model fitting.当一个变量中有 NA 值而不是另一个变量时,会出现一个问题,并且在模型拟合中删除了 NA 的观测值。 The result may be fitted and null models with different sets of observations.结果可能是具有不同观察集的拟合模型和空模型。 Setting restrictNobs to TRUE ensures that only observations in the fit model are used in the null model.将restrictNobs 设置为TRUE 可确保在空模型中仅使用拟合模型中的观察值。 This appears to work for lm and some glm models, but causes the function to fail for other model object types这似乎适用于 lm 和一些 glm 模型,但会导致该函数对其他模型对象类型失败

If you set restrictNobs to TRUE you should see the same output如果您将restrictNobs设置为TRUE您应该会看到相同的输出

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM