简体   繁体   English

使用ROCR套件的R中的ROC曲线

[英]ROC curve in R using ROCR package

Can someone explain me please how to plot a ROC curve with ROCR. 有人可以向我解释一下如何使用ROCR绘制ROC曲线。 I know that I should first run: 我知道我应该先跑步:

prediction(predictions, labels, label.ordering = NULL)

and then: 接着:

performance(prediction.obj, measure, x.measure="cutoff", ...)

I am just not clear what is meant with prediction and labels. 我只是不清楚预测和标签的含义。 I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. 我使用ctree和cforest创建了一个模型,我希望它们的ROC曲线最后进行比较。 In my case the class attribute is y_n, which I suppose should be used for the labels. 在我的情况下,class属性是y_n,我认为应该将其用于标签。 But what about the predictions? 但是这些预测呢? Here are the steps of what I do (dataset name= bank_part): 这是我要做的步骤(数据集名称= bank_part):

pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)

After running the last line I get this error: 运行最后一行后,出现以下错误:

Error in prediction(tablebank, bank_part$y_n) : 
Number of cross-validation runs must be equal for predictions and labels.

Thanks in advance! 提前致谢!

Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below: 这是另一个示例:我有训练数据集(bank_training)和测试数据集(bank_testing),并且运行了randomForest,如下所示:

bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,    
keep.forest=TRUE,importance=TRUE) 
bankrf.pred<-predict(bankrf, bank_testing, type='response')

Now the bankrf.pred is a factor object with labels c=("0", "1"). 现在,bankrf.pred是带有标签c =(“ 0”,“ 1”)的因子对象。 Still, I don't know how to plot ROC, cause I get stuck to the prediction part. 仍然,我不知道如何绘制ROC,因为我陷入了预测部分。 Here's what I do 这是我的工作

library(ROCR) 
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1) 

But this is still incorrect, cause I get the error message 但这仍然是不正确的,因为我收到了错误消息

Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors

The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable. 预测是您对分类的连续预测,标签是每个变量的二进制真值。

So something like the following should work: 因此,类似以下内容的方法应该起作用:

> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)

to generate an ROC. 生成ROC。

EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment). 编辑:在问题中包含示例可复制代码可能对您有所帮助(我很难解释您的评论)。

There's no new code here, but... here's a function I use quite often for plotting an ROC: 这里没有新代码,但是...这是我经常用于绘制ROC的函数:

 plotROC <- function(truth, predicted, ...){
   pred <- prediction(abs(predicted), truth)    
   perf <- performance(pred,"tpr","fpr")

   plot(perf, ...)
}

Like @Jeff said, your predictions need to be continuous for ROCR 's prediction function. 就像@Jeff所说的那样,您需要对ROCRprediction功能进行连续的prediction require(randomForest); ?predict.randomForest require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. require(randomForest); ?predict.randomForest显示,默认情况下, predict.randomForest返回原始比例的预测(分类中的类标签),而predict.randomForest(..., type = 'prob')返回每个类的概率。 So: 所以:

require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))

gives you what you want. 给你你想要的。 Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs') , predict(..., type='prob')[,2] , etc., so just check out the help files for each function you're calling. 不同的分类程序包需要不同的命令来获取预测的概率-有时它是predict(..., type='probs')predict(..., type='prob')[,2]等,因此只需检查一下即可列出您要调用的每个功能的帮助文件。

This is how you can do it: 这是您可以执行的操作:

have our data in a csv file,("data_file.csv") but you may need to give the full path here. 将我们的数据保存在一个csv文件中(“ data_file.csv”),但是您可能需要在此处提供完整路径。 In that file have the column headers, which here I will use "default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value. 在该文件中具有列标题,在这里我将使用“ default_flag”,“ var1”,“ var2”,“ var3”,其中default_flag为0或1,其他变量具有任何值。 R code: R代码:

rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~  var1 + var2 + var3, family = "binomial" , data = df)

summary(mylogit)
library(ROCR)

df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc

Note that df$score will give you the probability of default. 请注意,df $ score将为您提供违约的可能性。 In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use 如果您想使用此logit(相同的回归系数)来测试另一个用于交叉验证的数据df2,请使用

df2 <- read.csv("data_file2.csv")

df2$score<-predict.glm(mylogit,newdata=df2, type="response" )

pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc

The problem is, as pointed out by others, prediction in ROCR expects numerical values. 正如其他人所指出的那样,问题是ROCR中的预测需要数值。 If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response' , which is the default. 如果要从randomForest插入预测(作为randomForest预测的第一个参数),则该预测需要通过type='prob'而不是type='response' ,这是默认设置。 Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). 或者,您可以采用type='response'结果并将其转换为数值(即,如果您的响应为0/1)。 But when you plot that, ROCR generates a single meaningful point on ROC curve. 但是,当您绘制该图时,ROCR会在ROC曲线上生成一个有意义的点。 For having many points on your ROC curve, you really need the probability associated with each prediction - ie use type='prob' in generating predictions. 为了在ROC曲线上具有许多点,您确实需要与每个预测关联的概率-即在生成预测时使用type='prob'

The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation. 问题可能是您想在多个运行中运行预测功能,例如,进行交叉验证。

In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix. 在这种情况下,对于预测(predictions,labels,label.ordering = NULL)函数,“ predictions”和“ labels”变量的类应为列表或矩阵。

Try this one: 试试这个:

library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)

The function prediction is present is many packages. 存在的功能预测有很多包。 You should explicitly specify(ROCR::) to use the one in ROCR. 您应该显式指定(ROCR::)以在ROCR中使用那个。 This one worked for me. 这个为我工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM