[英]Why is rpart more accurate than Caret rpart in R
這篇文章提到,由於自舉和交叉驗證,Caret rpart比rpart更准確:
為什么使用caret :: train(...,method =“ rpart”)的結果與rpart :: rpart(...)不同?
盡管當我比較這兩種方法時,Caret rpart的精度為0.4879,rpart的精度為0.7347(我在下面復制了代碼)。
除此之外,與rpart相比,插入符rpart的分類樹只有幾個節點(拆分)
有誰了解這些差異?
謝謝!
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Loading libraries and the data
This is an R Markdown document. First we load the libraries and the data and split the trainingdata into a training and a testset.
```{r section1, echo=TRUE}
# load libraries
library(knitr)
library(caret)
suppressMessages(library(rattle))
library(rpart.plot)
# set the URL for the download
wwwTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
wwwTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(wwwTrain))
testing <- read.csv(url(wwwTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.05, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
# set seed for reproducibility
set.seed(12345)
```
## Cleaning the data
```{r section2, echo=TRUE}
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
dim(TestSet)
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
dim(TestSet)
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
```
## Prediction modelling
First we build a classification model using Caret with the rpart method:
```{r section4, echo=TRUE}
mod_rpart <- train(classe ~ ., method = "rpart", data = TrainSet)
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
mod_rpart$finalModel
fancyRpartPlot(mod_rpart$finalModel)
```
Second we build a similar model using rpart:
```{r section7, echo=TRUE}
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
confMatDecTree
```
一個簡單的解釋是您沒有調整任何一個模型,並且在默認設置下,rpart的表現純屬偶然。
當您使用相同的參數時,您應該期望具有相同的性能。
讓我們用caret
進行一些調整:
set.seed(1)
mod_rpart <- train(classe ~ .,
method = "rpart",
data = TrainSet,
tuneLength = 50,
metric = "Accuracy",
trControl = trainControl(method = "repeatedcv",
number = 4,
repeats = 5,
summaryFunction = multiClassSummary,
classProbs = TRUE))
pred_rpart <- predict(mod_rpart, TestSet)
confusionMatrix(pred_rpart, TestSet$classe)
#output
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 4359 243 92 135 38
B 446 2489 299 161 276
C 118 346 2477 300 92
D 190 377 128 2240 368
E 188 152 254 219 2652
Overall Statistics
Accuracy : 0.7628
95% CI : (0.7566, 0.7688)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7009
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8223 0.6900 0.7622 0.7332 0.7741
Specificity 0.9619 0.9214 0.9444 0.9318 0.9466
Pos Pred Value 0.8956 0.6780 0.7432 0.6782 0.7654
Neg Pred Value 0.9316 0.9253 0.9495 0.9469 0.9490
Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
Detection Rate 0.2339 0.1335 0.1329 0.1202 0.1423
Detection Prevalence 0.2611 0.1970 0.1788 0.1772 0.1859
Balanced Accuracy 0.8921 0.8057 0.8533 0.8325 0.8603
這比使用默認設置的rpart
更好( cp = 0.01
)
如果我們將插入符號選擇的最佳cp設置為怎么樣?
modFitDecTree <- rpart(classe ~ .,
data = TrainSet,
method = "class",
control = rpart.control(cp = mod_rpart$bestTune))
predictDecTree <- predict(modFitDecTree, newdata = TestSet, type = "class" )
confusionMatrix(predictDecTree, TestSet$classe)
#part of ouput
Accuracy : 0.7628
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.