简体   繁体   English


[英]how to interpret the accuracy of a model output using caret package

I am using caret package to train a model and would like to get the accuracy of the model. 我正在使用插入符号包来训练模型,并希望获得模型的准确性。 A common way I heard is to use confusionMatrix. 我听到的一种常见方法是使用confusionMatrix。 However, when I run my code below, the trained model gives me some accuracy values that are slightly different from what confusionMatrix() reports. 但是,当我在下面运行代码时,经过训练的模型为我提供了一些准确度值,这些值与confusionMatrix()报告的略有不同。 So my question is what accuracy should I use? 所以我的问题是我应该使用什么精度? How to interpret the accuracy the model gives directly in the console? 如何解释模型直接在控制台中提供的准确性?

ModelRF_ALL_b <- train(price~.,method="rf",data=datatraining_b)

The console reports the following 控制台报告以下内容

Random Forest 

8143 samples
   8 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 8143, 8143, 8143, 8143, 8143, 8143, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  2     0.9948108  0.9843501
  4     0.9945824  0.9836512
  7     0.9940732  0.9821099

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.

I can also run confusionMatrix() 我也可以运行confusionMatrix()


It gives accuracy of 1. 给出的精度为1。

Confusion Matrix and Statistics

Prediction    0    1
     0 6414    0
     1    0 1729

           Accuracy : 1          
             95% CI : (0.9995, 1)
No Information Rate : 0.7877     
P-Value [Acc > NIR] : < 2.2e-16  

              Kappa : 1          
 Mcnemar's Test P-Value : NA         

        Sensitivity : 1.0000     
        Specificity : 1.0000     
     Pos Pred Value : 1.0000     
     Neg Pred Value : 1.0000     
         Prevalence : 0.7877     
     Detection Rate : 0.7877     
   Detection Prevalence : 0.7877     
  Balanced Accuracy : 1.0000     

   'Positive' Class : 0     

You can interpret these values as in-sample accuracies with and without resampling, respectively. 您可以将这些值分别解释为带或不带重采样的样本内精度。

The package caret performs bootstrapped resampling with 25 repetition when you fit the model, which can be seen in your model output. 当您拟合模型时,包caret会执行25次重复的自举重采样,这可以在模型输出中看到。 So, the accuracy value is based on 25 x 8143 observations. 因此,精度值基于25 x 8143观测值。 In order to create the confusion matrix, you are using the final model (the one with mtry = 2) to predict the outcomes from the training sample, which has a length of 8143. Therefore, it is normal to have a slight difference in corresponding accuracies. 为了创建混淆矩阵,您使用的是最终模型(mtry = 2的模型)来预测训练样本的结果,该样本的长度为8143。因此,在相应样本中略有差异是正常的准确性。

You need to be careful while assessing the goodness-of-fit because you are training and evaluating your model using the same dataset. 在评估拟合优度时,您需要谨慎,因为您正在使用同一数据集训练和评估模型。 No surprise that you get a large accuracy. 毫不奇怪,您可以获得很高的准确性。 It is always good to evaluate your final model with an unseen dataset to ensure its performance and detect possible over-fitting issues. 最好使用看不见的数据集评估最终模型,以确保其性能并发现可能的过度拟合问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM