简体   繁体   中英

ROC curve in R using ROCR package

Can someone explain me please how to plot a ROC curve with ROCR. I know that I should first run:

prediction(predictions, labels, label.ordering = NULL)

and then:

performance(prediction.obj, measure, x.measure="cutoff", ...)

I am just not clear what is meant with prediction and labels. I created a model with ctree and cforest and I want the ROC curve for both of them to compare it in the end. In my case the class attribute is y_n, which I suppose should be used for the labels. But what about the predictions? Here are the steps of what I do (dataset name= bank_part):

pred<-cforest(y_n~.,bank_part)
tablebank<-table(predict(pred),bank_part$y_n)
prediction(tablebank, bank_part$y_n)

After running the last line I get this error:

Error in prediction(tablebank, bank_part$y_n) : 
Number of cross-validation runs must be equal for predictions and labels.

Thanks in advance!

Here's another example: I have the training dataset(bank_training) and testing dataset(bank_testing) and I ran a randomForest as below:

bankrf<-randomForest(y~., bank_training, mtry=4, ntree=2,    
keep.forest=TRUE,importance=TRUE) 
bankrf.pred<-predict(bankrf, bank_testing, type='response')

Now the bankrf.pred is a factor object with labels c=("0", "1"). Still, I don't know how to plot ROC, cause I get stuck to the prediction part. Here's what I do

library(ROCR) 
pred<-prediction(bankrf.pred$y, bank_testing$c(0,1) 

But this is still incorrect, cause I get the error message

Error in bankrf.pred$y_n : $ operator is invalid for atomic vectors

The predictions are your continuous predictions of the classification, the labels are the binary truth for each variable.

So something like the following should work:

> pred <- prediction(c(0.1,.5,.3,.8,.9,.4,.9,.5), c(0,0,0,1,1,1,1,1))
> perf <- performance(pred, "tpr", "fpr")
> plot(perf)

to generate an ROC.

EDIT: It may be helpful for you to include the sample reproducible code in the question (I'm having a hard time intepreting your comment).

There's no new code here, but... here's a function I use quite often for plotting an ROC:

 plotROC <- function(truth, predicted, ...){
   pred <- prediction(abs(predicted), truth)    
   perf <- performance(pred,"tpr","fpr")

   plot(perf, ...)
}

Like @Jeff said, your predictions need to be continuous for ROCR 's prediction function. require(randomForest); ?predict.randomForest require(randomForest); ?predict.randomForest shows that, by default, predict.randomForest returns a prediction on the original scale (class labels, in classification), whereas predict.randomForest(..., type = 'prob') returns probabilities of each class. So:

require(ROCR)
data(iris)
iris$setosa <- factor(1*(iris$Species == 'setosa'))
iris.rf <- randomForest(setosa ~ ., data=iris[,-5])
summary(predict(iris.rf, iris[,-5]))
summary(iris.preds <- predict(iris.rf, iris[,-5], type = 'prob'))
preds <- iris.preds[,2]
plot(performance(prediction(preds, iris$setosa), 'tpr', 'fpr'))

gives you what you want. Different classification packages require different commands for getting predicted probabilities -- sometimes it's predict(..., type='probs') , predict(..., type='prob')[,2] , etc., so just check out the help files for each function you're calling.

This is how you can do it:

have our data in a csv file,("data_file.csv") but you may need to give the full path here. In that file have the column headers, which here I will use "default_flag", "var1", "var2", "var3", where default_flag is 0 or 1 and the other variables have any value. R code:

rm(list=ls())
df <- read.csv("data_file.csv") #use the full path if needed
mylogit <- glm(default_flag ~  var1 + var2 + var3, family = "binomial" , data = df)

summary(mylogit)
library(ROCR)

df$score<-predict.glm(mylogit, type="response" )
pred<-prediction(df$score,df$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc

Note that df$score will give you the probability of default. In case you want to use this logit (same regression coefficients) to test in another data df2 set for cross validation, use

df2 <- read.csv("data_file2.csv")

df2$score<-predict.glm(mylogit,newdata=df2, type="response" )

pred<-prediction(df2$score,df2$default_flag)
perf<-performance(pred,"tpr", "fpr")
plot(perf)
auc<- performance(pred,"auc")
auc

The problem is, as pointed out by others, prediction in ROCR expects numerical values. If you are inserting predictions from randomForest (as the first argument into prediction in ROCR), that prediction needs to be generated by type='prob' instead of type='response' , which is the default. Alternatively, you could take type='response' results and convert to numerical (that is, if your responses are, say 0/1). But when you plot that, ROCR generates a single meaningful point on ROC curve. For having many points on your ROC curve, you really need the probability associated with each prediction - ie use type='prob' in generating predictions.

The problem may be that you would like to run the prediction function on multiple runs for example for cross-validatation.

In this case for prediction(predictions, labels, label.ordering = NULL) function the class of "predictions" and "labels" variables should be list or matrix.

Try this one:

library(ROCR)
pred<-ROCR::prediction(bankrf.pred$y, bank_testing$c(0,1)

The function prediction is present is many packages. You should explicitly specify(ROCR::) to use the one in ROCR. This one worked for me.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM