為什么Rpart在R中比Caret rpart更准確

Question

這篇文章提到，由於自舉和交叉驗證，Caret rpart比rpart更准確：

為什么使用caret :: train（...，method =“ rpart”）的結果與rpart :: rpart（...）不同？

盡管當我比較這兩種方法時，Caret rpart的精度為0.4879，rpart的精度為0.7347（我在下面復制了代碼）。

除此之外，與rpart相比，插入符rpart的分類樹只有幾個節點（拆分）

有誰了解這些差異？

謝謝！

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Loading libraries and the data

This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.

```{r section1, echo=TRUE}

# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)

# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest  <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the datasets
training <- read.csv(url(wwwTrain))
testing  <- read.csv(url(wwwTest))

# create a partition with the training dataset 
inTrain  <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet  <- training[-inTrain, ]
dim(TrainSet)

# set seed for reproducibility        
set.seed(12345)

```
## Cleaning the data

```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet  <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)

# remove variables that are mostly NA
AllNA    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet  <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)

# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)


```

## Prediction modelling

First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}

mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)

mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)

```

Second we build a similar model using rpart:
```{r section7, echo=TRUE}

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree

```

Answer 1

一個簡單的解釋是您沒有調整任何一個模型，並且在默認設置下，rpart的表現純屬偶然。

當您使用相同的參數時，您應該期望具有相同的性能。

讓我們用caret進行一些調整：

set.seed(1)
mod_rpart <- train(classe ~ .,
                   method = "rpart",
                   data = TrainSet,
                   tuneLength = 50, 
                   metric = "Accuracy",
                   trControl = trainControl(method = "repeatedcv",
                                            number = 4,
                                            repeats = 5,
                                            summaryFunction = multiClassSummary,
                                            classProbs = TRUE))

pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 4359  243   92  135   38
         B  446 2489  299  161  276
         C  118  346 2477  300   92
         D  190  377  128 2240  368
         E  188  152  254  219 2652

Overall Statistics

               Accuracy : 0.7628          
                 95% CI : (0.7566, 0.7688)
    No Information Rate : 0.2844          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.7009          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8223   0.6900   0.7622   0.7332   0.7741
Specificity            0.9619   0.9214   0.9444   0.9318   0.9466
Pos Pred Value         0.8956   0.6780   0.7432   0.6782   0.7654
Neg Pred Value         0.9316   0.9253   0.9495   0.9469   0.9490
Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2339   0.1335   0.1329   0.1202   0.1423
Detection Prevalence   0.2611   0.1970   0.1788   0.1772   0.1859
Balanced Accuracy      0.8921   0.8057   0.8533   0.8325   0.8603

這比使用默認設置的rpart更好（ cp = 0.01 ）

如果我們將插入符號選擇的最佳cp設置為怎么樣？

modFitDecTree <- rpart(classe ~ .,
                       data = TrainSet,
                       method = "class",
                       control = rpart.control(cp = mod_rpart$bestTune))

predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628

為什么Rpart在R中比Caret rpart更准確

問題描述

1 個解決方案

解決方案1
3 已采納 2018-04-15 12:05:53

為什么Rpart在R中比Caret rpart更准確

問題描述

1 個解決方案

解決方案1 3 已采納 2018-04-15 12:05:53

解決方案1
3 已采納 2018-04-15 12:05:53