[英]R: how to plot ROC for logistic regression model whit missing values
I have a logistic regression model and I'd like to plot ROC curve.我有一个逻辑回归 model 并且我想要 plot ROC 曲线。 All variables have some missing data.
所有变量都有一些缺失的数据。 Here's the summary:
这是摘要:
X<-cbind(outcome, var1, var2)
summary(X)
# outcome var1 var2
# Min. :0.0000 Min. : 0.100 Min. : 65.1
# 1st Qu.:0.0000 1st Qu.: 0.600 1st Qu.: 91.9
# Median :0.0000 Median : 1.000 Median :101.0
# Mean :0.2643 Mean : 2.421 Mean :110.3
# 3rd Qu.:1.0000 3rd Qu.: 2.200 3rd Qu.:114.5
# Max. :1.0000 Max. :34.800 Max. :388.4
# NA's :165 NA's :80 NA's :30
The model seems to work: model 似乎工作:
model <- glm(outcome~var1+var2,family=binomial)
summary(model)
# Call:
# glm(formula = outcome ~ var1 + var2, family = binomial)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.63470 -0.67079 -0.56255 0.01727 2.07577
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -3.652208 0.973013 -3.754 0.000174 ***
# var1 0.386811 0.147054 2.630 0.008528 **
# var2 0.016165 0.008075 2.002 0.045316 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 135.91 on 117 degrees of freedom
# Residual deviance: 108.84 on 115 degrees of freedom
# (187 observations deleted due to missingness)
# AIC: 114.84
#
# Number of Fisher Scoring iterations: 6
But when I try to calcutate ROC curve, there is an error:但是当我尝试计算 ROC 曲线时,出现错误:
library(pROC)
roc(model)
# Error in roc.default(model) : No valid data provided.
I thought it could be due to missing data and I tried to add na.action = na.exclude option, but the problem still persist:我认为这可能是由于缺少数据,我尝试添加 na.action = na.exclude 选项,但问题仍然存在:
model2 <- glm(outcome~var1+var2,family=binomial, na.action = na.exclude)
roc(model2)
# Error in roc.default(model2) : No valid data provided.
I also tried with lrm instead of glm, but still doesn't work:我也尝试使用 lrm 而不是 glm,但仍然无法正常工作:
model.lrm<-lrm(outcome~var1+var2, options(na.action="na.delete"), x=TRUE, y=TRUE)
model.lrm
# Frequencies of Missing Values Due to Each Variable
# outcome var1 var2
# 165 80 30
#
# Logistic Regression Model
#
# lrm(formula = outcome ~ var1 + var2, data = options(na.action = "na.delete"),
# x = TRUE, y = TRUE)
#
#
# Model Likelihood Discrimination Rank Discrim.
# Ratio Test Indexes Indexes
# Obs 118 LR chi2 27.07 R2 0.300 C 0.782
# 0 87 d.f. 2 g 1.377 Dxy 0.565
# 1 31 Pr(> chi2) <0.0001 gr 3.964 gamma 0.565
# max |deriv| 7e-05 gp 0.189 tau-a 0.221
# Brier 0.150
# Coef S.E. Wald Z Pr(>|Z|)
# Intercept -3.6522 0.9730 -3.75 0.0002
# var1 0.3868 0.1471 2.63 0.0085
# var2 0.0162 0.0081 2.00 0.0453
#
roc(model.lrm)
# Error in roc.default(model.lrm) : No valid data provided.
Here are first 20 observations:以下是前 20 个观察结果:
> dput(head (dati[, c(2,3,4)], 20))
structure(list(outcome = c(NA, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, NA, 0), var1 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 0.3, 0.5, 1.5, 4.5, 2, 2.2, 0.7, NA, NA, 0.3),
var2 = c(117, 84, NA, 90, 91, 113, 88, NA, 108, 178, 100,
86, 86, 95, 92, 111, 103, 81, NA, 95)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
What is the problem?问题是什么?
A ROC curve isn't built on a model, but on predictions derived from the model. ROC 曲线不是基于 model 构建的,而是基于从 model 得出的预测。 Therefore you need to use the
predict
function to obtain predictions on the data.因此,您需要使用
predict
function 来获得对数据的预测。 It looks like this:它看起来像这样:
predictions <- predict(model)
And then you can call the roc
function with those:然后您可以使用以下命令调用
roc
function:
roc(outcome, predictions)
The missing values will be ignored automatically.缺失值将被自动忽略。
This makes it easy and very similar if you are using a test set:如果您使用的是测试集,这将使其变得简单且非常相似:
test_predictions <- predict(model, newdata = test_data)
roc(test_data$outcome, test_predictions)
I found a solution modifying the code as follow:我找到了一个修改代码的解决方案,如下所示:
roc(outcome, as.vector(fitted.values(model)),plot=TRUE)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.