简体   繁体   中英

Possible reasons for AUC=1 (from a fitted glm model)?

I am running high-throughput microarray data (methylation array), and after running univariate, lasso and cross-validation lasso analyses, I was able to come down to a list of 15 probes (predictors).

Now, I want to run a ROC/AUC curve in order to check whether those predictors are in fact good candidates. Problem is that the result coming out of it is a ROC curve with an AUC=1. I have been trying to twitch the fitted model (ie family and maxit ), but the results did not change.

Here is a sample of the data (with 8 predictors) and the downstream analysis, with some explanation:

          Tumor   probe_1     probe_2     probe_3    probe_4    probe_5    probe_6    probe_7     probe_8
Benign.A4    No -5.076257 -3.18658187 -2.91627872 -3.2393655 -2.4080861 -3.9414602 -4.5844204 -2.96877633
Benign.A1    No -3.232952 -2.21518181  0.71340947 -2.1103999 -1.4563154 -4.0614544 -2.9378821 -0.90468942
Benign.C2    No -4.487701 -3.34515435 -5.35341349 -2.0355878 -2.9573763 -4.2980546 -4.3421487 -2.35597830
Benign.C8    No -3.692610 -1.24332686 -0.59115736 -3.4852858 -2.3339160 -3.1302782 -3.0943430 -1.03581249
Benign.D7    No -2.978757 -0.05097524  0.02744634 -1.4946543 -1.5593915 -2.8860660 -2.7633458 -0.99299595
Benign.D3    No -2.441925 -1.98227873 -2.13478645 -3.0265593 -2.7789079 -3.9860489 -2.8512663 -2.61804934
Tumor.A6    Yes  1.044348 -5.85637090 -4.49697162  1.5033139  0.3226736  1.5937440 -0.4881769  0.95135529
Tumor.A5    Yes  1.749187 -2.93393903 -5.54439148  2.4403760  1.6238294 -1.1699169  3.0410728  1.07437064
Tumor.A2    Yes  2.323806 -6.57693143 -5.78690184  1.7684931  2.3522317  0.3517146 -1.9972320  1.46663990
Tumor.C1    Yes  2.229316 -6.69010615 -6.22036584  0.7482678  1.3277280  0.6128029  1.3349142  1.63602050
Tumor.C6    Yes  2.888489 -5.79079519 -5.02991621  1.4605461  1.3002248  1.1498193  0.4481215  0.81473797
Tumor.C5    Yes -1.861726 -5.14400193 -5.26197761  1.0023323  0.8582683  0.5492184  0.6720438  1.73785369
Tumor.D1    Yes  2.776804 -6.78537165 -6.20280759  2.0623420  1.8291220  1.7328508  1.3667038  1.77813837
Tumor.D6    Yes  2.985209 -6.13405436 -5.92181030  1.8801728  1.1815045  2.2210693  0.1363381  2.21102559
Tumor.D8    Yes  1.670136 -6.72855542 -6.61156537  1.9847271  1.6267041 -2.8621148  0.7134887 -0.56794735
Tumor.A3    Yes  2.106628 -5.61286600 -5.75976883  2.1291475  0.5839721  1.4210874  1.2746626  1.77239233
Tumor.A8    Yes  1.798005 -5.53405698 -5.34042037  3.0262657  1.2199790  1.2448107  1.2297283  0.25649834
Tumor.A7    Yes  1.798074 -6.03775348 -5.01964376  1.2428083  2.3899569  0.6292222  0.6439477  0.92047002
Tumor.C3    Yes  1.542737 -6.54219832 -5.94287577  1.6111676  2.1889028  0.1228641  0.7950770  1.38000135
Tumor.C7    Yes  3.369420 -6.84809093 -5.88474727  2.7525838  3.2090893  1.1435739  1.2199450  0.89089956
Tumor.C4    Yes  3.179484 -6.59432541 -5.68920298  2.4093288  2.3173752 -0.3378846  1.3653768  0.66432101
Tumor.D5    Yes  2.328382 -6.41234621 -6.18003184 -0.1768171  2.1202506  2.4287615  1.7804487  0.08098025
Tumor.D4    Yes  3.051829 -7.01875245 -6.32614849  1.4200916  2.3582254  2.4981644  1.7878118  1.14826500
Tumor.D2    Yes  2.686846 -3.57625801 -6.25573666  1.6330575  0.8448418  1.4229245 -0.6461006  0.09491185

The glm analysis:

> glmcgs <- glm(Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 + probe_5 + probe_6 + probe_7 + probe_8 + probe_9 + 
                  probe_10 + probe_11 + probe_12 + probe_13 + probe_14 + probe_15, data=cgshort, family = quasibinomial(link = 'logit'), maxit=100)

> summary(glmcgs)

Call:
glm(formula = Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 + 
    probe_5 + probe_6 + probe_7 + probe_8 + probe_9 + probe_10 + 
    probe_11 + probe_12 + probe_13 + probe_14 + probe_15, family = quasibinomial(link = "logit"), 
    data = cgshort, maxit = 100)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-6.227e-06  -3.066e-07   3.076e-06   4.536e-06   6.389e-06  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.68423    5.23171   4.527 0.001932 ** 
probe_1       0.41584    1.03539   0.402 0.698471    
probe_2      -0.88243    0.80631  -1.094 0.305630    
probe_3      -1.14642    0.60525  -1.894 0.094819 .  
probe_4       0.08650    1.64350   0.053 0.959314    
probe_5      -1.46564    1.38381  -1.059 0.320469    
probe_6      -0.72839    1.35910  -0.536 0.606580    
probe_7       2.59539    0.48714   5.328 0.000704 ***
probe_8       2.03890    1.43339   1.422 0.192700    
probe_9       0.87683    1.52469   0.575 0.581041    
probe_10      1.79828    0.80940   2.222 0.057028 .  
probe_11      0.66033    0.93300   0.708 0.499195    
probe_12    -14.75184    2.98871  -4.936 0.001141 ** 
probe_13      3.30891    1.31239   2.521 0.035737 *  
probe_14      0.36376    0.99582   0.365 0.724368    
probe_15     -0.03516    0.91771  -0.038 0.970375    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 6.771977e-11)

    Null deviance: 2.6992e+01  on 23  degrees of freedom
Residual deviance: 3.9885e-10  on  8  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 24

PS: the reason why I am using quasibinomial here is because within the samples "Tumor", there are 2 different stages. However, there is no statistical difference in methylation levels between them (previous analysis has been performed).

And finally, the ROC curve with AUC:

> roc.final <- roc(cgshort$Tumor, fitted(glmcgs), smooth=FALSE)

Call:
roc.default(response = cgshort$Tumor, predictor = fitted(glmcgs),     smooth = FALSE)

Data: fitted(glmcgs) in 6 controls (cgshort$Tumor No) < 18 cases (cgshort$Tumor Yes).
Area under the curve: 1

My guess is because the sample size is not big enough, which would also explain the high standard error. Would that be it? And would there be any way to still evaluate the efficiency of those potential predictors in such a sample?

Any help is greatly welcome. Thanks!

If the model manages to set apart successes and unsuccesses completely, AUC will be 1. you have very little data and many predictors, and some of them are very effective in predicting outcome, so no surprise here that the model appears deterministic and ROC curve is square.

Let me guess: you used maxit parameter because model estimation woudn't converge. This means that that solution is not reliable. To get one you may use generalized LASSO, or some other kind of regularization.

This is a statistics question by the way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM