简体   繁体   English

如何处理R(pROC包)中的多类ROC分析?

[英]How to deal with multiple class ROC analysis in R (pROC package)?

When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code: 例如,当我在R(pROC包)中使用multiclass.roc函数时,我通过随机林训练了一个数据集,这是我的代码:

# randomForest & pROC packages should be installed:
# install.packages(c('randomForest', 'pROC'))
data(iris)
library(randomForest)
library(pROC)
set.seed(1000)
# 3-class in response variable
rf = randomForest(Species~., data = iris, ntree = 100)
# predict(.., type = 'prob') returns a probability matrix
multiclass.roc(iris$Species, predict(rf, iris, type = 'prob'))

And the result is: 结果是:

Call:
multiclass.roc.default(response = iris$Species, predictor = predict(rf,     
iris, type = "prob"))
Data: predict(rf, iris, type = "prob") with 3 levels of iris$Species: setosa,   
versicolor, virginica.
Multi-class area under the curve: 0.5142

Is this right? 这是正确的吗? Thanks!!! 谢谢!!!

"pROC" reference: http://www.inside-r.org/packages/cran/pROC/docs/multiclass.roc “pROC”参考: http//www.inside-r.org/packages/cran/pROC/docs/multiclass.roc

As you saw in the reference, multiclass.roc expects a "numeric vector (...)", and the documentation of roc that is linked from there (for some reason not in the link you provided) further says "of the same length than response ". 正如您在参考文献中看到的那样,multiclass.roc期望一个“数字向量(...)”,并且从那里链接的roc文档(由于某种原因不在您提供的链接中)进一步说“长度相同”比response “。 You are passing a numeric matrix with 3 columns, which is clearly wrong, and isn't supported any more since pROC 1.6. 您正在传递一个包含3列的数字矩阵,这显然是错误的,并且自pROC 1.6以来不再支持。 I have no idea what it was doing before, probably not what you were expecting. 我不知道它之前做了什么,可能不是你所期待的。

This means you must summarize your predictions in one single atomic vector of numeric mode. 这意味着您必须在数字模式的单个原子向量中汇总您的预测。 In the case of your model, you could use the following, although it generally doesn't really make sense to convert a factor into a numeric: 对于您的模型,您可以使用以下内容,尽管将因子转换为数字通常没有意义:

predictions <- as.numeric(predict(rf, iris, type = 'response'))
multiclass.roc(iris$Species, predictions)

What this code really does is to compute 3 ROC curves on your predictions (one with setosa vs. versicolor, one with versicolor vs. virginica, and one with setosa vs. virginica) and average their AUC. 这段代码真正做的是计算你预测的3条ROC曲线(一条用于setosa与versicolor,一条用versicolor与virginica,一条用setosa与virginica)并平均它们的AUC。

Three more comments: 还有三条评论:

  • I say converting a factor to numeric doesn't make sense because you'll get different results if you don't have a perfect classification and you reorder the levels. 我说将一个因子转换为数字是没有意义的,因为如果你没有一个完美的分类并且你重新排序水平,你会得到不同的结果。 This is why it isn't done automatically in pROC: you must think about it in your setup. 这就是为什么它不是在pROC中自动完成的:你必须在你的设置中考虑它。
  • In general, this multiclass averaging doesn't really make sense and you're better off re-thinking your question in terms of binary classification. 一般来说,这种多类平均值并没有多大意义,你最好在二元分类方面重新思考你的问题。 There are more advanced multiclass methods (with a ROC surface etc.) that aren't implemented yet in pROC 还有更高级的多类方法(具有ROC表面等)尚未在pROC中实现
  • As was stated by @cbeleites, it is not correct to evaluate a model with its training data (resubstitution) so in a real example you must keep a test set aside or use cross-validation. 正如@cbeleites所说,用其训练数据(重新取代)评估模型是不正确的,因此在一个真实的例子中,你必须保留一个测试,或者使用交叉验证。

Assuming that you did the resubstitution estimate only for sake of the minimal working example your code looks good to me. 假设您仅为了最小化工作示例而进行了重新取代估算,您的代码对我来说很好。

I quickly tried to get an oob prediction with type "prob" but didn't succeed. 我很快尝试用“prob”类型进行oob预测,但没有成功。 Thus, you'll need to do a validation external to the randomForest function. 因此,您需要在randomForest函数外部进行验证。

Personally, I'd not try to summarize a whole multiclass model into one unconditional number. 就个人而言,我不会试图将整个多类模型概括为一个无条件数字。 But that's an entirely different question. 但这是一个完全不同的问题。

I copied your code and got an AUC of .83. 我复制了你的代码并获得了0.83的AUC。 Not sure what is different. 不确定有什么不同。

You are right, the s100b column is not a probability. 你是对的, s100b列不是概率。 The aSAH (Aneurysmal subarachnoid hemorrhage) data set is a clinical data set. aSAH(动脉瘤性蛛网膜下腔出血)数据集是临床数据集。 s100b is a protein found in glial cells in the brain. s100b是在脑中的神经胶质细胞中发现的蛋白质。 From the research paper on the dataset, s100b column seems to represent the concentration of the s100b protein (ug/l) likely in a blood sample. 从关于数据集的研究论文中, s100b柱似乎代表了血液样本中s100b蛋白(ug / l)的浓度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM