[英]ROC in rfe() in caret package for R
I am using the caret package in R for training a radial basis SVM for classification; 我使用R中的插入包来训练径向基SVM进行分类; in addition, a linear SVM is used for variable selection.
此外,线性SVM用于变量选择。 With metric="Accuracy", this works fine, but eventually I am more interested in optimizing metric="ROC".
使用metric =“Accuracy”,这很好,但最终我更感兴趣的是优化metric =“ROC”。 While the ROC is calculated for all models that are fit, there seems to be some problem with aggregating the ROC values.
虽然计算所有适合的模型的ROC,但是聚合ROC值似乎存在一些问题。
The following is some example code: 以下是一些示例代码:
library(caret)
library(mlbench)
set.seed(0)
data(Sonar)
x<-scale(Sonar[,1:60])
y<-as.factor(Sonar[,61])
# Custom summary function to use both
# defaultSummary() and twoClassSummary
# Also input and output of summary function are printed
svm.summary<-function(data, lev = NULL, model = NULL){
print(head(data,n=3))
a<-defaultSummary(data, lev, model)
b<-twoClassSummary(data, lev, model)
out<-c(a,b)
print(out)
out}
fitControl <- trainControl(
method = "cv",
number = 2,
classProbs = TRUE,
summaryFunction=svm.summary,
verbose=T,
allowParallel = FALSE)
# Ranking function: Rank Variables using a linear
# SVM
rankSVM<-function(object,x,y) {
print("ranking")
obj<-ksvm(x=as.matrix(x), y=y,
kernel=vanilladot,
kpar=list(), C=10,
scaled=F)
w<-t(obj@coef[[1]]%*%obj@xmatrix[[1]])
z<-abs(w)/sqrt(sum(w^2))
ord<-order(z,decreasing=T)
data.frame(var=dimnames(z)[[1]][ord],Overall=z[ord])
}
svmFuncs<-getModelInfo("svmRadial",regex=F)
svmFit<-function(x,y,first,last,...) {
out<-train(x=x,y=as.factor(y),
method="svmRadial",
trControl=fitControl,
scaled=F,
metric="Accuracy",
maximize=T,
returnData=T)
out$finalModel}
selectionFunctions<-list(summary=svm.summary,
fit=svmFit,
pred=svmFuncs$svmRadial$predict,
prob=svmFuncs$svmRadial$prob,
rank=rankSVM,
selectSize=pickSizeBest,
selectVar=pickVars)
selectionControl<-rfeControl(functions=selectionFunctions,
rerank=F,
verbose=T,
method="cv",
number=2)
subsets<-c(1,30,60)
svmProfile<-rfe(x=x,y=y,
sizes=subsets,
metric="Accuracy",
maximize=TRUE,
rfeControl=selectionControl)
svmProfile
The final output is the following: 最终输出如下:
> svmProfile
Recursive feature selection
Outer resampling method: Cross-Validated (2 fold)
Resampling performance over subset size:
Variables Accuracy Kappa ROC Sens Spec AccuracySD KappaSD ROCSD SensSD SpecSD Selected
1 0.8075 0.6122 NaN 0.8292 0.7825 0.02981 0.06505 NA 0.06153 0.1344 *
30 0.8028 0.6033 NaN 0.8205 0.7825 0.00948 0.02533 NA 0.09964 0.1344
60 0.8028 0.6032 NaN 0.8206 0.7823 0.00948 0.02679 NA 0.12512 0.1635
The top 1 variables (out of 1):
V49
ROC is NaN. ROC是NaN。 Inspecting the output (as verbose=T and the summary function was patched to display both its output and parts of its input) reveals that while when tuning the SVMs in the inner loop, ROC seems to be calculated correctly:
检查输出(作为详细= T和摘要函数被修补以显示其输出和输入的部分)表明,当调整内循环中的SVM时,ROC似乎正确计算:
+ Fold1: sigma=0.01172, C=0.25
pred obs M R
1 M R 0.6658878 0.3341122
2 M R 0.5679477 0.4320523
3 R R 0.2263576 0.7736424
Accuracy Kappa ROC Sens Spec
0.6730769 0.3480826 0.7961310 0.6428571 0.7083333
- Fold1: sigma=0.01172, C=0.25
+ Fold1: sigma=0.01172, C=0.50
pred obs M R
1 M R 0.7841249 0.2158751
2 M R 0.7231365 0.2768635
3 R R 0.3033492 0.6966508
Accuracy Kappa ROC Sens Spec
0.7692308 0.5214724 0.8407738 0.9642857 0.5416667
- Fold1: sigma=0.01172, C=0.50
[...]
there seems to be a problem in the outer iteration. 在外部迭代中似乎存在问题。 "Between" two folds we get the following:
“两次之间”我们得到以下内容:
-(rfe) fit Fold1 size: 1
pred obs Variables
1 M R 1
2 M R 1
3 M R 1
Accuracy Kappa ROC Sens Spec
0.7864078 0.5662328 NA 0.8727273 0.6875000
pred obs Variables
1 R R 30
2 M R 30
3 M R 30
Accuracy Kappa ROC Sens Spec
0.7961165 0.5853939 NA 0.8909091 0.6875000
pred obs Variables
1 R R 60
2 M R 60
3 M R 60
Accuracy Kappa ROC Sens Spec
0.7961165 0.5842783 NA 0.9090909 0.6666667
+(rfe) fit Fold2 size: 60
So here it seems the input for the summary function is a matrix that does not contain the class probabilities but the number of variables instead, and so the ROCs cannot be calculated / aggregated correctly. 所以这里似乎摘要函数的输入是一个矩阵,它不包含类概率,而是包含变量的数量,因此无法正确计算/聚合ROC。 Does anybody know how to prevent this?
有人知道如何防止这种情况吗? Did I forget to tell caret to output class probabilities in some place?
我忘了告诉插入符号在某个地方输出类概率吗?
Help is greatly appreciated, as caret is really a cool package to use and would save me plenty of work if I can get this to run correctly. 非常感谢帮助,因为Caret真的是一个很酷的包使用,如果我可以正常运行,将节省我很多工作。
Thoralf 托拉尔夫
getModelInfo
is designed to get code for train
and doesn't automatically work with rfe
(I'll make a note of that in the documentation). getModelInfo
旨在获取train
代码,并且不会自动使用rfe
(我将在文档中记下这一点)。 rfe
doesn't look for a slot called probs
and no probability predictions means not ROC summary. rfe
不寻找名为probs
的插槽,也没有概率预测意味着没有ROC摘要。
You might want base your code on caretFuncs
, which is designed to work with rfe
and should automate a lot of what I think you would like to do. 你可能希望将你的代码基于
caretFuncs
,它设计用于rfe
并且应该自动执行我认为你想做的很多事情。
For example, in caretFuncs
, the pred
module will create class and probability predictions: 例如,在
caretFuncs
, pred
模块将创建类和概率预测:
function(object, x) {
tmp <- predict(object, x)
if(object$modelType == "Classification" &
!is.null(object$modelInfo$prob)) {
out <- cbind(data.frame(pred = tmp),
as.data.frame(predict(object, x, type = "prob")))
} else out <- tmp
out
}
You might be able to simply plug in your rankSVM
into caretFuncs$rank
. 您可以简单地将
rankSVM
插入caretFuncs$rank
。
Take a look at the feature selection page on the website . 请查看网站上的功能选择页面 。 It has details about what code modules you will need.
它包含您需要的代码模块的详细信息。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.