[英]R: how to plot ROC for logistic regression model whit missing values
我有一個邏輯回歸 model 並且我想要 plot ROC 曲線。 所有變量都有一些缺失的數據。 這是摘要:
X<-cbind(outcome, var1, var2)
summary(X)
# outcome var1 var2
# Min. :0.0000 Min. : 0.100 Min. : 65.1
# 1st Qu.:0.0000 1st Qu.: 0.600 1st Qu.: 91.9
# Median :0.0000 Median : 1.000 Median :101.0
# Mean :0.2643 Mean : 2.421 Mean :110.3
# 3rd Qu.:1.0000 3rd Qu.: 2.200 3rd Qu.:114.5
# Max. :1.0000 Max. :34.800 Max. :388.4
# NA's :165 NA's :80 NA's :30
model 似乎工作:
model <- glm(outcome~var1+var2,family=binomial)
summary(model)
# Call:
# glm(formula = outcome ~ var1 + var2, family = binomial)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.63470 -0.67079 -0.56255 0.01727 2.07577
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -3.652208 0.973013 -3.754 0.000174 ***
# var1 0.386811 0.147054 2.630 0.008528 **
# var2 0.016165 0.008075 2.002 0.045316 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 135.91 on 117 degrees of freedom
# Residual deviance: 108.84 on 115 degrees of freedom
# (187 observations deleted due to missingness)
# AIC: 114.84
#
# Number of Fisher Scoring iterations: 6
但是當我嘗試計算 ROC 曲線時,出現錯誤:
library(pROC)
roc(model)
# Error in roc.default(model) : No valid data provided.
我認為這可能是由於缺少數據,我嘗試添加 na.action = na.exclude 選項,但問題仍然存在:
model2 <- glm(outcome~var1+var2,family=binomial, na.action = na.exclude)
roc(model2)
# Error in roc.default(model2) : No valid data provided.
我也嘗試使用 lrm 而不是 glm,但仍然無法正常工作:
model.lrm<-lrm(outcome~var1+var2, options(na.action="na.delete"), x=TRUE, y=TRUE)
model.lrm
# Frequencies of Missing Values Due to Each Variable
# outcome var1 var2
# 165 80 30
#
# Logistic Regression Model
#
# lrm(formula = outcome ~ var1 + var2, data = options(na.action = "na.delete"),
# x = TRUE, y = TRUE)
#
#
# Model Likelihood Discrimination Rank Discrim.
# Ratio Test Indexes Indexes
# Obs 118 LR chi2 27.07 R2 0.300 C 0.782
# 0 87 d.f. 2 g 1.377 Dxy 0.565
# 1 31 Pr(> chi2) <0.0001 gr 3.964 gamma 0.565
# max |deriv| 7e-05 gp 0.189 tau-a 0.221
# Brier 0.150
# Coef S.E. Wald Z Pr(>|Z|)
# Intercept -3.6522 0.9730 -3.75 0.0002
# var1 0.3868 0.1471 2.63 0.0085
# var2 0.0162 0.0081 2.00 0.0453
#
roc(model.lrm)
# Error in roc.default(model.lrm) : No valid data provided.
以下是前 20 個觀察結果:
> dput(head (dati[, c(2,3,4)], 20))
structure(list(outcome = c(NA, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, NA, 0), var1 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, 0.3, 0.5, 1.5, 4.5, 2, 2.2, 0.7, NA, NA, 0.3),
var2 = c(117, 84, NA, 90, 91, 113, 88, NA, 108, 178, 100,
86, 86, 95, 92, 111, 103, 81, NA, 95)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
問題是什么?
ROC 曲線不是基於 model 構建的,而是基於從 model 得出的預測。 因此,您需要使用predict
function 來獲得對數據的預測。 它看起來像這樣:
predictions <- predict(model)
然后您可以使用以下命令調用roc
function:
roc(outcome, predictions)
缺失值將被自動忽略。
如果您使用的是測試集,這將使其變得簡單且非常相似:
test_predictions <- predict(model, newdata = test_data)
roc(test_data$outcome, test_predictions)
我找到了一個修改代碼的解決方案,如下所示:
roc(outcome, as.vector(fitted.values(model)),plot=TRUE)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.