[英]How to plot a ROC curve from Classification Tree probabilities
I am attempting to plot a ROC curve with classification trees probabilities.我试图用分类树概率绘制 ROC 曲线。 However, when I plot the curve, it is absent.
但是,当我绘制曲线时,它不存在。 I am trying to plot the ROC curve and then find the AUC value from the area under the curve.
我正在尝试绘制 ROC 曲线,然后从曲线下方的区域中找到 AUC 值。 Does anyone know how to fix this?
有谁知道如何解决这一问题? Thank you if you can.
可以的话谢谢。 The binary column Risk stands for risk misclassification, which I presume is my label.
二进制列 Risk 代表风险错误分类,我认为这是我的标签。 Should I be applying the ROC curve equation at a different point in my code?
我应该在代码中的不同点应用 ROC 曲线方程吗?
Here is the data frame:这是数据框:
library(ROCR)
data(Risk.table)
pred = prediction(Risk.table$Predicted.prob, Risk.table2$Risk)
perf = performance(pred, measure="tpr", x.measure="fpr")
perf
plot(perf)
Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 G8 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1
#Split data 70:30 after shuffling the data frame
index<-1:nrow(LDA.scores1)
trainindex.LDA3=sample(index, trunc(length(index)*0.70),replace=FALSE)
LDA.70.trainset3<-shuffle.cross.validation2[trainindex.LDA3,]
LDA.30.testset3<-shuffle.cross.validation2[-trainindex.LDA3,]
tree.split3<-rpart(Family~., data=LDA.70.trainset3, method="class")
tree.split3
summary(tree.split3)
print(tree.split3)
plot(tree.split3)
text(tree.split3,use.n=T,digits=0)
printcp(tree.split3)
tree.split3
res3=predict(tree.split3,newdata=LDA.30.testset3)
res4=as.data.frame(res3)
res4$predicted<-NA
res4$actual<-NA
for (i in 1:length(res4$G8)){
if(res4$R2[i]>res4$V4[i]) {
res4$predicted[i]<-"G8"
}
else {
res4$predicted[i]<-"V4"
}
print(i)
}
res4
res4$actual<-LDA.30.testset3$Family
res4
Risk.table$Risk<-NA
Risk.table
for (i in 1:length(Risk.table$Risk)){
if(Risk.table$predicted[i]==res4$actual[i]) {
Risk.table$Risk[i]<-1
}
else {
Risk.table$Risk[i]<-0
}
print(i)
}
#Confusion Matrix
cm=table(res4$actual, res4$predicted)
names(dimnames(cm))=c("actual", "predicted")
index<-1:nrow(significant.lda.Wilks2)
trainindex.LDA.help1=sample(index, trunc(length(index)*0.70), replace=FALSE)
sig.train=significant.lda.Wilks2[trainindex.LDA.help1,]
sig.test=significant.lda.Wilks2[-trainindex.LDA.help1,]
library(klaR)
nbmodel<-NaiveBayes(Family~., data=sig.train)
prediction<-predict(nbmodel, sig.test)
NB<-as.data.frame(prediction)
colnames(NB)<-c("Actual", "Predicted.prob", "acual.prob")
NB$actual2 = NA
NB$actual2[NB$Actual=="G8"] = 1
NB$actual2[NB$Actual=="V4"] = 0
NB2<-as.data.frame(NB)
plot(fit.perf, col="red"); #Naive Bayes
plot(perf, col="blue", add=T); #Classification Tree
abline(0,1,col="green")
library(caret)
library(e1071)
train_control<-trainControl(method="repeatedcv", number=10, repeats=3)
model<-train(Matriline~., data=LDA.scores, trControl=train_control, method="nb")
predictions <- predict(model, LDA.scores[,2:13])
confusionMatrix(predictions,LDA.scores$Family)
Confusion Matrix and Statistics
Reference
Prediction V4 G8
V4 25 2
G8 5 48
Accuracy : 0.9125
95% CI : (0.828, 0.9641)
No Information Rate : 0.625
P-Value [Acc > NIR] : 4.918e-09
Kappa : 0.8095
Mcnemar's Test P-Value : 0.4497
Sensitivity : 0.8333
Specificity : 0.9600
Pos Pred Value : 0.9259
Neg Pred Value : 0.9057
Prevalence : 0.3750
Detection Rate : 0.3125
Detection Prevalence : 0.3375
Balanced Accuracy : 0.8967
'Positive' Class : V4
I have various things to point out:我有很多事情要指出:
1) I think your code has to be Family ~ .
1)我认为你的代码必须是
Family ~ .
inside your rpart command.在您的 rpart 命令中。
2) In your initial table I can see a value W3
in your predicted column. 2) 在您的初始表中,我可以在您的预测列中看到值
W3
。 Does that mean you don't have a binary dependent variable?这是否意味着您没有二元因变量? ROC curves work with binary data, so check it.
ROC 曲线适用于二进制数据,因此请检查它。
3) Your predicted and actual probabilities in your initial table always sum to 1. Is that reasonable? 3) 初始表中的预测概率和实际概率总和为 1。这合理吗? I think they represent something else, so you might consider changing names in case they confuse you in the future.
我认为它们代表其他东西,因此您可以考虑更改名称,以防将来它们使您感到困惑。
4) I think you're confused about how ROC works and what inputs it needs. 4)我认为您对 ROC 的工作方式及其需要的输入感到困惑。 Your
Risk
column uses 1 to represent a correct prediction and 0 to represent a wrong prediction.您的
Risk
列使用 1 表示正确的预测,使用 0 表示错误的预测。 However, the ROC curve needs 1 to represent one class and 0 to represent the other class.但是,ROC 曲线需要 1 来表示一类,而 0 来表示另一类。 In simple words, the command is
prediction(predictions, labels)
where predictions
are your predicted probabilities and labels
are the true class/levels of your dependent variable.简而言之,命令是
prediction(predictions, labels)
,其中predictions
是您的预测概率, labels
是因变量的真实类别/级别。 Check the following code:检查以下代码:
dt = read.table(text="
Id Predicted.prob Actual.prob predicted actual Risk
1 0.5384615 0.4615385 G8 V4 0
2 0.1212121 0.8787879 V4 V4 1
3 0.5384615 0.4615385 G8 G8 1
4 0.9000000 0.1000000 G8 G8 1
5 0.1212121 0.8787879 V4 V4 1
6 0.1212121 0.8787879 V4 V4 1
7 0.9000000 0.1000000 G8 G8 1
8 0.5384615 0.4615385 G8 V4 0
9 0.5384615 0.4615385 G8 V4 0
10 0.1212121 0.8787879 V4 G8 0
11 0.1212121 0.8787879 V4 V4 1
12 0.9000000 0.1000000 G8 V4 0
13 0.9000000 0.1000000 G8 V4 0
14 0.1212121 0.8787879 W3 V4 1
15 0.9000000 0.1000000 G8 G8 1
16 0.5384615 0.4615385 G8 V4 0
17 0.9000000 0.1000000 G8 V4 0
18 0.1212121 0.8787879 V4 V4 1
19 0.5384615 0.4615385 G8 V4 0
20 0.1212121 0.8787879 V4 V4 1
21 0.9000000 0.1000000 G8 G8 1
22 0.5384615 0.4615385 G8 V4 0
23 0.9000000 0.1000000 G8 V4 0
24 0.1212121 0.8787879 V4 V4 1", header=T)
library(ROCR)
roc_pred <- prediction(dt$Predicted.prob, dt$Risk)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
The ROC curve is : ROC曲线为:
When you create a new column actual2
where you have 1 instead of G8 and 0 instead of V4:当你创建一个新的列
actual2
,你有 1 而不是 G8 和 0 而不是 V4:
dt$actual2 = NA
dt$actual2[dt$actual=="G8"] = 1
dt$actual2[dt$actual=="V4"] = 0
roc_pred <- prediction(dt$Predicted.prob, dt$actual2)
perf <- performance(roc_pred, "tpr", "fpr")
plot(perf, col="red")
abline(0,1,col="grey")
5) As @eipi10 mentioned above, you should try to get rid of the for loops in your code. 5)正如上面提到的@eipi10,您应该尝试摆脱代码中的for循环。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.