简体   繁体   English

如何根据分类树概率绘制 ROC 曲线

[英]How to plot a ROC curve from Classification Tree probabilities

I am attempting to plot a ROC curve with classification trees probabilities.我试图用分类树概率绘制 ROC 曲线。 However, when I plot the curve, it is absent.但是,当我绘制曲线时,它不存在。 I am trying to plot the ROC curve and then find the AUC value from the area under the curve.我正在尝试绘制 ROC 曲线,然后从曲线下方的区域中找到 AUC 值。 Does anyone know how to fix this?有谁知道如何解决这一问题? Thank you if you can.可以的话谢谢。 The binary column Risk stands for risk misclassification, which I presume is my label.二进制列 Risk 代表风险错误分类,我认为这是我的标签。 Should I be applying the ROC curve equation at a different point in my code?我应该在代码中的不同点应用 ROC 曲线方程吗?

Here is the data frame:这是数据框:

   library(ROCR)

   data(Risk.table)

   pred = prediction(Risk.table$Predicted.prob, Risk.table2$Risk)
   perf = performance(pred, measure="tpr", x.measure="fpr")
   perf
   plot(perf)

   Predicted.prob Actual.prob   predicted actual Risk
  1       0.5384615   0.4615385        G8     V4    0
  2       0.1212121   0.8787879        V4     V4    1
  3       0.5384615   0.4615385        G8     G8    1
  4       0.9000000   0.1000000        G8     G8    1
  5       0.1212121   0.8787879        V4     V4    1
  6       0.1212121   0.8787879        V4     V4    1
  7       0.9000000   0.1000000        G8     G8    1
  8       0.5384615   0.4615385        G8     V4    0
  9       0.5384615   0.4615385        G8     V4    0
  10      0.1212121   0.8787879        V4     G8    0
  11      0.1212121   0.8787879        V4     V4    1
  12      0.9000000   0.1000000        G8     V4    0
  13      0.9000000   0.1000000        G8     V4    0
  14      0.1212121   0.8787879        G8     V4    1
  15      0.9000000   0.1000000        G8     G8    1
  16      0.5384615   0.4615385        G8     V4    0
  17      0.9000000   0.1000000        G8     V4    0
  18      0.1212121   0.8787879        V4     V4    1
  19      0.5384615   0.4615385        G8     V4    0
  20      0.1212121   0.8787879        V4     V4    1
  21      0.9000000   0.1000000        G8     G8    1
  22      0.5384615   0.4615385        G8     V4    0
  23      0.9000000   0.1000000        G8     V4    0
  24      0.1212121   0.8787879        V4     V4    1

Here is the ROC curve this code outputs, but the curve is missing:这是此代码输出的 ROC 曲线,但缺少该曲线:

在此处输入图片说明

I tried again and this ROC curve is just wrong我又试了一次,这个 ROC 曲线是错误的

在此处输入图片说明

I constructed the above data frame using the code below:我使用以下代码构建了上述数据框:

The initial data frame containing all the data is called shuffle.cross.validation2包含所有数据的初始数据帧称为 shuffle.cross.validation2

  #Split data 70:30 after shuffling the data frame

  index<-1:nrow(LDA.scores1)
  trainindex.LDA3=sample(index, trunc(length(index)*0.70),replace=FALSE)      

  LDA.70.trainset3<-shuffle.cross.validation2[trainindex.LDA3,]

  LDA.30.testset3<-shuffle.cross.validation2[-trainindex.LDA3,]

Run classification tree using package rpart()使用包 rpart() 运行分类树

 tree.split3<-rpart(Family~., data=LDA.70.trainset3, method="class")
 tree.split3
 summary(tree.split3)
 print(tree.split3)
 plot(tree.split3)
 text(tree.split3,use.n=T,digits=0)
 printcp(tree.split3)
 tree.split3

Predict the predicted and actual data预测预测数据和实际数据

 res3=predict(tree.split3,newdata=LDA.30.testset3)
 res4=as.data.frame(res3)

Create two columns with NA's (Actual and predicted classification rate)使用 NA(实际和预测分类率)创建两列

 res4$predicted<-NA
 res4$actual<-NA


 for (i in 1:length(res4$G8)){

 if(res4$R2[i]>res4$V4[i]) {
 res4$predicted[i]<-"G8"
 }

 else {
 res4$predicted[i]<-"V4"
 }

  print(i)
 }

 res4

 res4$actual<-LDA.30.testset3$Family
 res4
 Risk.table$Risk<-NA
 Risk.table

Create the binary predictor column创建二元预测器列

  for (i in 1:length(Risk.table$Risk)){

  if(Risk.table$predicted[i]==res4$actual[i]) {
  Risk.table$Risk[i]<-1
  }

  else {
  Risk.table$Risk[i]<-0
  }

  print(i)
  }

Creation of the predicted and actual probabilities for the two families V4 and G8 above为上述两个家族 V4 和 G8 创建预测和实际概率

    #Confusion Matrix

    cm=table(res4$actual, res4$predicted)

    names(dimnames(cm))=c("actual", "predicted")

Naive Bayes朴素贝叶斯

  index<-1:nrow(significant.lda.Wilks2)
  trainindex.LDA.help1=sample(index, trunc(length(index)*0.70), replace=FALSE)                                     
  sig.train=significant.lda.Wilks2[trainindex.LDA.help1,]
  sig.test=significant.lda.Wilks2[-trainindex.LDA.help1,]


    library(klaR)
    nbmodel<-NaiveBayes(Family~., data=sig.train)
    prediction<-predict(nbmodel, sig.test)
    NB<-as.data.frame(prediction)
    colnames(NB)<-c("Actual", "Predicted.prob", "acual.prob")

    NB$actual2 = NA
    NB$actual2[NB$Actual=="G8"] = 1
    NB$actual2[NB$Actual=="V4"] = 0
    NB2<-as.data.frame(NB)

    plot(fit.perf, col="red"); #Naive Bayes
    plot(perf, col="blue", add=T); #Classification Tree
    abline(0,1,col="green")

在此处输入图片说明

Original Naive Bayes code using the caret package使用 caret 包的原始朴素贝叶斯代码

     library(caret)
     library(e1071)

  train_control<-trainControl(method="repeatedcv", number=10, repeats=3)
  model<-train(Matriline~., data=LDA.scores, trControl=train_control,    method="nb")
  predictions <- predict(model, LDA.scores[,2:13])
  confusionMatrix(predictions,LDA.scores$Family)

Results结果

               Confusion Matrix and Statistics

                        Reference
                Prediction V4 G8
                        V4 25  2
                        G8  5 48

                  Accuracy : 0.9125         
                    95% CI : (0.828, 0.9641)
       No Information Rate : 0.625          
       P-Value [Acc > NIR] : 4.918e-09      

                    Kappa : 0.8095         
   Mcnemar's Test P-Value : 0.4497         

              Sensitivity : 0.8333         
              Specificity : 0.9600         
           Pos Pred Value : 0.9259         
           Neg Pred Value : 0.9057         
               Prevalence : 0.3750         
           Detection Rate : 0.3125         
     Detection Prevalence : 0.3375         
        Balanced Accuracy : 0.8967         

         'Positive' Class : V4         

I have various things to point out:我有很多事情要指出:

1) I think your code has to be Family ~ . 1)我认为你的代码必须是Family ~ . inside your rpart command.在您的 rpart 命令中。

2) In your initial table I can see a value W3 in your predicted column. 2) 在您的初始表中,我可以在您的预测列中看到值W3 Does that mean you don't have a binary dependent variable?这是否意味着您没有二元因变量? ROC curves work with binary data, so check it. ROC 曲线适用于二进制数据,因此请检查它。

3) Your predicted and actual probabilities in your initial table always sum to 1. Is that reasonable? 3) 初始表中的预测概率和实际概率总和为 1。这合理吗? I think they represent something else, so you might consider changing names in case they confuse you in the future.我认为它们代表其他东西,因此您可以考虑更改名称,以防将来它们使您感到困惑。

4) I think you're confused about how ROC works and what inputs it needs. 4)我认为您对 ROC 的工作方式及其需要的输入感到困惑。 Your Risk column uses 1 to represent a correct prediction and 0 to represent a wrong prediction.您的Risk列使用 1 表示正确的预测,使用 0 表示错误的预测。 However, the ROC curve needs 1 to represent one class and 0 to represent the other class.但是,ROC 曲线需要 1 来表示一类,而 0 来表示另一类。 In simple words, the command is prediction(predictions, labels) where predictions are your predicted probabilities and labels are the true class/levels of your dependent variable.简而言之,命令是prediction(predictions, labels) ,其中predictions是您的预测概率, labels是因变量的真实类别/级别。 Check the following code:检查以下代码:

dt = read.table(text="
Id Predicted.prob Actual.prob   predicted actual Risk
1       0.5384615   0.4615385        G8     V4    0
2       0.1212121   0.8787879        V4     V4    1
3       0.5384615   0.4615385        G8     G8    1
4       0.9000000   0.1000000        G8     G8    1
5       0.1212121   0.8787879        V4     V4    1
6       0.1212121   0.8787879        V4     V4    1
7       0.9000000   0.1000000        G8     G8    1
8       0.5384615   0.4615385        G8     V4    0
9       0.5384615   0.4615385        G8     V4    0
10      0.1212121   0.8787879        V4     G8    0
11      0.1212121   0.8787879        V4     V4    1
12      0.9000000   0.1000000        G8     V4    0
13      0.9000000   0.1000000        G8     V4    0
14      0.1212121   0.8787879        W3     V4    1
15      0.9000000   0.1000000        G8     G8    1
16      0.5384615   0.4615385        G8     V4    0
17      0.9000000   0.1000000        G8     V4    0
18      0.1212121   0.8787879        V4     V4    1
19      0.5384615   0.4615385        G8     V4    0
20      0.1212121   0.8787879        V4     V4    1
21      0.9000000   0.1000000        G8     G8    1
22      0.5384615   0.4615385        G8     V4    0
23      0.9000000   0.1000000        G8     V4    0
24      0.1212121   0.8787879        V4     V4    1", header=T)

library(ROCR)

roc_pred <- prediction(dt$Predicted.prob, dt$Risk)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")

The ROC curve is : ROC曲线为:

在此处输入图片说明

When you create a new column actual2 where you have 1 instead of G8 and 0 instead of V4:当你创建一个新的列actual2 ,你有 1 而不是 G8 和 0 而不是 V4:

dt$actual2 = NA
dt$actual2[dt$actual=="G8"] = 1
dt$actual2[dt$actual=="V4"] = 0

roc_pred <- prediction(dt$Predicted.prob, dt$actual2)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")

在此处输入图片说明

5) As @eipi10 mentioned above, you should try to get rid of the for loops in your code. 5)正如上面提到的@eipi10,您应该尝试摆脱代码中的for循环。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM