R Caret中随机森林的混淆矩阵

Question

I have data with binary YES/NO Class response. 我有二进制YES / NO Class响应的数据。 Using following code for running RF model. 使用以下代码运行RF模型。 I have problem in getting confusion matrix result. 我在获得混淆矩阵结果时遇到问题。

 dataR <- read_excel("*:/*.xlsx")
 Train    <- createDataPartition(dataR$Class, p=0.7, list=FALSE)  
 training <- dataR[ Train, ]
 testing  <- dataR[ -Train, ]

model_rf  <- train(  Class~.,  tuneLength=3,  data = training, method = 
"rf",  importance=TRUE,  trControl = trainControl (method = "cv", number = 
5))

Results: 结果：

Random Forest 

3006 samples
82 predictor
2 classes: 'NO', 'YES' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 2405, 2406, 2405, 2404, 2404 
Addtional sampling using SMOTE

Resampling results across tuning parameters:

 mtry  Accuracy   Kappa    
  2    0.7870921  0.2750655
  44    0.7787721  0.2419762
 87    0.7767760  0.2524898

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2.

So far fine, but when I run this code: 到目前为止很好，但是当我运行这段代码时：

# Apply threshold of 0.50: p_class
class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")

# Create confusion matrix
p <-confusionMatrix(class_log, testing[["Class"]])

##gives the accuracy
p$overall[1]

I get this error: 我收到此错误：

 Error in model_rf[, 1] : incorrect number of dimensions

I appreciate if you guys can help me to get confusion matrix result. 如果你们能帮助我得到混淆矩阵结果，我感激不尽。

Answer 1

As I understand you would like to obtain the confusion matrix for cross validation in caret. 据我所知，您希望获得插入符号中交叉验证的混淆矩阵。

For this you need to specify savePredictions in trainControl . 为此，您需要在savePredictions中指定trainControl 。 If it is set to "final" predictions for the best model are saved. 如果设置为"final" ，则保存最佳模型的预测。 By specifying classProbs = T probabilities for each class will be also saved. 通过指定classProbs = T ，还将保存每个类的概率。

data(iris)
iris_2 <- iris[iris$Species != "setosa",] #make a two class problem
iris_2$Species <- factor(iris_2$Species) #drop levels

library(caret)
model_rf  <- train(Species~., tuneLength = 3, data = iris_2, method = 
                       "rf", importance = TRUE,
                   trControl = trainControl(method = "cv",
                                            number = 5,
                                            savePredictions = "final",
                                            classProbs = T))

Predictions are in: 预测在：

model_rf$pred

sorted as per CV fols, to sort as in original data frame: 按照CV fols排序，按原始数据框排序：

model_rf$pred[order(model_rf$pred$rowIndex),2]

to obtain a confusion matrix: 获得混淆矩阵：

confusionMatrix(model_rf$pred[order(model_rf$pred$rowIndex),2], iris_2$Species)
#output
Confusion Matrix and Statistics

            Reference
Prediction   versicolor virginica
  versicolor         46         6
  virginica           4        44

               Accuracy : 0.9            
                 95% CI : (0.8238, 0.951)
    No Information Rate : 0.5            
    P-Value [Acc > NIR] : <2e-16         

                  Kappa : 0.8            
 Mcnemar's Test P-Value : 0.7518         

            Sensitivity : 0.9200         
            Specificity : 0.8800         
         Pos Pred Value : 0.8846         
         Neg Pred Value : 0.9167         
             Prevalence : 0.5000         
         Detection Rate : 0.4600         
   Detection Prevalence : 0.5200         
      Balanced Accuracy : 0.9000         

       'Positive' Class : versicolor

In a two class setting often specifying 0.5 as the threshold probability is sub-optimal. 在两类设置中，通常指定0.5作为阈值概率是次优的。 The optimal threshold can be found after training by optimizing Kappa or Youden's J statistic (or any other preferred) as a function of the probability. 通过优化Kappa或Youden的J统计量（或任何其他优选的）作为概率的函数，可以在训练之后找到最佳阈值。 Here is an example: 这是一个例子：

sapply(1:40/40, function(x){
  versicolor <- model_rf$pred[order(model_rf$pred$rowIndex),4]
  class <- ifelse(versicolor >=x, "versicolor", "virginica")
  mat <- confusionMatrix(class, iris_2$Species)
  kappa <- mat$overall[2]
  res <- data.frame(prob = x, kappa = kappa)
  return(res)
})

Here the highest kappa is not obtained at threshold == 0.5 but at 0.1. 这里最高的kappa不是在threshold == 0.5但是在0.1时获得的。 This should be used carefully because it can lead to over-fitting. 这应该谨慎使用，因为它可能导致过度配合。

Answer 2

You can try this to create confusion matrix and check accuracy 您可以尝试这样来创建混淆矩阵并检查准确性

m <- table(class_log, testing[["Class"]])
m   #confusion table

#Accuracy
(sum(diag(m)))/nrow(testing)

Answer 3

The code piece class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO") is an if-else statement that performs the following test: 代码片段class_log <- ifelse(model_rf[,1] > 0.50, "YES", "NO")是执行以下测试的if-else语句：

In the first column of model_rf , if the number is greater than 0.50, return "YES", else return "NO", and save the results in object class_log . 在model_rf的第一列中，如果数字大于0.50，则返回“YES”，否则返回“NO”，并将结果保存在对象class_log 。

So the code essentially creates a character vector of class labels, "YES" and "NO", based on a numeric vector. 因此，代码基本上根据数字向量创建类标签的字符向量，“YES”和“NO”。

Answer 4

You need to apply your model to the test set. 您需要将模型应用于测试集。

prediction.rf <- predict(model_rf, testing, type = "prob")

Then do class_log <- ifelse(prediction.rf > 0.50, "YES", "NO") 然后执行class_log <- ifelse(prediction.rf > 0.50, "YES", "NO")

R Caret中随机森林的混淆矩阵

问题描述

4 个解决方案

解决方案1
3 已采纳 2017-10-18 20:48:33

解决方案2
1 2017-10-18 17:48:39

解决方案3
0 2017-10-18 18:01:11

解决方案4
0 2017-10-18 18:30:30

R Caret中随机森林的混淆矩阵

问题描述

4 个解决方案

解决方案1 3 已采纳 2017-10-18 20:48:33

解决方案2 1 2017-10-18 17:48:39

解决方案3 0 2017-10-18 18:01:11

解决方案4 0 2017-10-18 18:30:30

解决方案1
3 已采纳 2017-10-18 20:48:33

解决方案2
1 2017-10-18 17:48:39

解决方案3
0 2017-10-18 18:01:11

解决方案4
0 2017-10-18 18:30:30