简体   繁体   English

Rcaret包中的火车功能

[英]The train function in R caret package

Suppose I have a data set and I want to do a 4-fold cross validation using logistic regression. 假设我有一个数据集,我想使用逻辑回归进行4倍交叉验证。 So there will be 4 different models. 因此,将有4种不同的模型。 In R, I did the following: 在R中,我执行了以下操作:

ctrl <- trainControl(method = "repeatedcv", number = 4, savePredictions = TRUE)
mod_fit <- train(outcome ~., data=data1, method = "glm", family="binomial", trControl = ctrl)

I would assume that mod_fit should contain 4 separate sets of coefficients? 我认为mod_fit应该包含4组独立的系数? When I type modfit$finalModel$ I just get the same set of coefficients. 当我输入modfit$finalModel$我只会得到相同的一组系数。

I've created a reproducible example based on your code snippet. 我已经根据您的代码段创建了一个可复制的示例。 The first thing to notice about your code is that it's specifying repeatedcv as the method, but it doesn't give any repeats , so the number=4 parmeter is just telling it to resample 4 times (this is not an answer to your question but important to understand). 关于您的代码的第一件事要注意的是它指定了repeatedcv作为方法,但是它没有给出任何repeats ,所以number=4参数只是告诉它要重采样4次(这不是您问题的答案,但很重要)。

mod_fit$finalModel gives you only 1 set of coefficients because it's the one final model that's derived by aggergating the non-repeated k-fold CV results from each of the 4 folds. mod_fit$finalModel仅给您提供一组系数,因为它是通过对4折中每一个的未重复k折CV结果进行夸大而得出的最终模型。

You can see the fold-level performance in the resample object: 您可以在resample对象中看到折叠级别的性能:

library(caret)
library(mlbench)

data(iris)

iris$binary  <- ifelse(iris$Species=="setosa",1,0)
iris$Species <- NULL

ctrl    <- trainControl(method = "repeatedcv", 
                        number = 4, 
                        savePredictions = TRUE,
                        verboseIter = T,
                        returnResamp = "all")

mod_fit <- train(binary ~., 
                 data=iris, 
                 method = "glm", 
                 family="binomial", 
                 trControl = ctrl)


# Fold-level Performance
mod_fit$resample
  RMSE Rsquared parameter Resample 1 2.630866e-03 0.9999658 none Fold1.Rep1 2 3.863821e-08 1.0000000 none Fold2.Rep1 3 8.162472e-12 1.0000000 none Fold3.Rep1 4 2.559189e-13 1.0000000 none Fold4.Rep1 

To your earlier point, the package is not going to save and display information on the coefficients of each fold. 到您以前的观点,该包装将不会保存和显示有关每折系数的信息。 In addition the the performance information above, does however save the index (list of in-sample rows), indexOut (hold how rows), and random seeds for each fold, thus if you were so inclined it would be easy to reconstruct the intermediate models. 但是,除了上面的性能信息外,确实还保存了index (样本行的列表), indexOut (保存行的方式)以及每折的随机种子,因此,如果您倾向于,可以轻松地重构中间值。楷模。

mod_fit$control$seeds
 [[1]] [1] 169815 [[2]] [1] 445763 [[3]] [1] 871613 [[4]] [1] 706905 [[5]] [1] 89408 
mod_fit$control$index
 $Fold1 [1] 1 2 3 4 5 6 7 8 9 10 11 12 15 18 19 21 22 24 28 30 31 32 33 34 35 40 41 42 43 44 45 46 47 

48 49 50 51 52 53 54 59 60 61 63 [45] 64 65 66 68 69 70 71 72 73 75 76 77 79 80 81 82 84 85 86 87 89 90 91 92 93 94 95 96 98 99 100 103 104 106 107 108 110 111 113 114 116 118 119 120 [89] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 140 141 142 143 145 147 149 150 48 49 50 51 52 53 54 59 60 61 63 [45] 64 65 66 68 69 70 71 72 73 75 76 77 79 80 81 82 84 85 86 87 89 90 91 92 93 94 95 96 98 99 100 103 104 106 107 108 110 111 113 114 116 118 119 120 [89] 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 140 140 141 142 143 145 147 149 150

 $Fold2 [1] 1 6 7 8 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42 

44 46 48 50 51 53 54 55 56 57 58 [45] 59 61 62 64 66 67 69 70 71 72 73 74 75 76 78 79 80 81 82 83 84 85 87 88 89 90 91 92 95 96 97 98 99 101 102 104 105 106 108 109 111 112 113 115 [89] 116 117 119 120 121 122 123 127 130 131 132 134 135 137 138 139 140 141 142 143 144 145 146 147 148 44 46 48 50 51 53 54 55 56 57 58 [45] 59 61 62 64 66 67 69 70 71 72 73 74 75 76 78 79 80 81 82 83 84 85 87 88 89 90 91 92 95 96 97 98 99 101 102 104 105 106 108 109 111 112 113 115 [89] 116 117 119 120 121 122 123 127 130 131 132 134 135 137 138 139 139 140 141 142 143 144 145 146 147 148

 $Fold3 [1] 2 3 4 5 6 7 8 9 10 11 13 14 16 17 20 23 24 25 26 27 28 29 30 33 35 36 37 38 39 40 41 43 45 

46 47 49 50 51 52 54 55 56 57 58 [45] 60 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 82 83 84 85 86 88 89 93 94 97 98 99 100 101 102 103 105 106 107 108 109 110 111 112 114 115 [89] 117 118 119 121 124 125 126 128 129 131 132 133 134 135 136 137 138 139 144 145 146 147 148 149 150 46 47 49 50 51 52 54 55 56 57 58 [45] 60 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 82 83 84 85 86 88 89 93 94 97 98 99 100 101 102 103 105 106 107 108 109 110 111 112 114 115 [89] 117 118 119 121 124 125 126 128 129 131 132 133 134 135 136 137 138 139 139 144 145 146 147 148 149 150

 $Fold4 [1] 1 2 3 4 5 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 31 32 34 36 37 38 39 41 

42 43 44 45 47 48 49 52 53 55 56 [45] 57 58 59 60 61 62 63 65 67 68 74 77 78 79 80 81 83 86 87 88 90 91 92 93 94 95 96 97 100 101 102 103 104 105 107 109 110 112 113 114 115 116 117 118 [89] 120 122 123 124 125 126 127 128 129 130 133 136 137 138 139 140 141 142 143 144 146 148 149 150 42 43 44 45 47 48 49 52 53 55 56 [45] 57 58 59 60 61 62 63 65 67 68 74 77 78 79 80 81 83 86 87 88 90 91 92 93 94 95 96 97 100 101 102 103 104 105 107 109 110 112 113 114 115 116 117 118 [89] 120 122 123 124 125 126 127 128 129 130 133 136 137 138 139 140 141 142 143 143 144 146 148 149 150

mod_fit$control$indexOut
 $Resample1 [1] 13 14 16 17 20 23 25 26 27 29 36 37 38 39 55 56 57 58 62 67 74 78 83 88 97 101 102 105 109 112 115 117 137 138 139 144 146 148 $Resample2 [1] 2 3 4 5 9 10 11 24 41 43 45 47 49 52 60 63 65 68 77 86 93 94 100 103 107 110 114 118 124 125 126 128 129 133 136 149 150 $Resample3 [1] 1 12 15 18 19 21 22 31 32 34 42 44 48 53 59 61 79 80 81 87 90 91 92 95 96 104 113 116 120 122 123 127 130 140 141 142 143 $Resample4 [1] 6 7 8 28 30 33 35 40 46 50 51 54 64 66 69 70 71 72 73 75 76 82 84 85 89 98 99 106 108 111 119 121 131 132 134 135 145 147 

@Damien your mod_fit will not contain 4 separate set of coefficients. @Damien您的mod_fit将不包含4组独立的系数。 You are asking for cross validation with 4 folds. 您需要进行4cross validation This does not mean you will have 4 different models. 这并不意味着您将有4种不同的模型。 According to the documentation here , the train function works as follows: 根据此处的文档, train功能的工作方式如下:

在此处输入图片说明

At the end of the resampling loop - in your case 4 iterations for 4 folds, you will have one set of average forecast accuracy measures (eg., rmse, R-squared), for a given one set of model parameters. 在重采样循环结束时-对于您的情况,进行4次迭代4次,对于给定的一组模型参数,您将具有一组平均预测准确性度量(例如rmse,R平方)。

Since you did not use tuneGrid or tuneLength argument in train function, by default, train function will tune over three values of each tuneable parameter. 由于您未在train函数中使用tuneGridtuneLength参数,因此默认情况下, train函数将调整每个可调参数的三个值。

This means you will have at most three models (not 4 models as you were expecting) and therefore three sets of average model performance measures. 这意味着您最多将拥有三个模型(而不是您期望的四个模型),因此将拥有三组平均模型性能指标。

The optimum model is the one that has the lowest rmse in case of regression. 最佳模型是在回归的情况下具有最低均方根值的模型。 This model coefficients are available in mod_fit$finalModel . 该模型系数可在mod_fit$finalModel

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM