随机森林中的变量选择和预测精度

Question

I have a cross-section data set repeated for 2 years, 2009 and 2010. I am using the first year (2009) as a training set to train a Random Forest for a regression problem and the second year (2010) as a test set.我有一个横截面数据集重复了 2 年，2009 年和 2010 年。我使用第一年（2009 年）作为训练集来训练回归问题的随机森林，第二年（2010 年）作为测试集.

Load the data加载数据

df <- read.csv("https://www.dropbox.com/s/t4iirnel5kqgv34/df.cv?dl=1")

After training the Random Forest for 2009 the variable importance indicates the variable x1 is the most important one.在 2009 年训练随机森林后，变量重要性表明变量x1是最重要的变量。

Random Forest using all variables使用所有变量的随机森林

set.seed(89)
rf2009 <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6,
                         data = df[df$year==2009,], 
                         ntree=500,
                         mtry = 6,
                         importance = TRUE)
print(rf2009)

Call:
 randomForest(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 6, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 5208746
                    % Var explained: 75.59

Variable importance可变重要性

imp.all <- as.data.frame(sort(importance(rf2009)[,1],decreasing = TRUE),optional = T)
names(imp.all) <- "% Inc MSE"
imp.all

% Inc MSE
x1 35.857840
x2 16.693059
x3 15.745721
x4 15.105710
x5  9.002924
x6  6.160413

I then move on to the test set and I receive the following accuracy metrics.然后我转到测试集，我收到以下准确度指标。

Prediction and evaluation on the test set对测试集的预测和评估

test.pred.all <- predict(rf2009,df[df$year==2010,])
RMSE.forest.all <- sqrt(mean((test.pred.all-df[df$year==2010,]$y)^2))
RMSE.forest.all
[1] 2258.041

MAE.forest.all <- mean(abs(test.pred.all-df[df$year==2010,]$y))
MAE.forest.all
[1] 299.0751

When I then train the model without the variable x1 , which was the most important one as per the above, and apply the trained model on the test set, I observe the following:然后当我在没有变量x1的情况下训练 model 时，这是上面最重要的变量，并在测试集上应用经过训练的 model ，我观察到以下内容：

the variance explained with x1 is higher than without x1 as expected正如预期的那样，用x1解释的方差高于没有x1的方差
but the RMSE for the test data is better without x1 ( RMSE : 2258.041 with x1 vs. 1885.462 without x1 )但是没有x1的测试数据的RMSE更好（ RMSE ：2258.041 与x1对比 1885.462 没有x1 ）
nevertheless MAE is slightly better with x1 (299.0751) vs. without it (302.3382).尽管如此，使用x1 (299.0751) 与不使用 x1 (302.3382) 相比， MAE稍好一些。

Random Forest excluding x1不包括 x1 的随机森林

rf2009nox1 <- randomForest(y ~ x2 + x3 + x4 + x5 + x6,
                       data = df[df$year==2009,], 
                       ntree=500,
                       mtry = 5,
                       importance = TRUE)
print(rf2009nox1)

Call:
 randomForest(formula = y ~ x2 + x3 + x4 + x5 + x6, data = df[df$year ==      2009, ], ntree = 500, mtry = 5, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 5

          Mean of squared residuals: 6158161
                    % Var explained: 71.14

Variable importance可变重要性

imp.nox1 <- as.data.frame(sort(importance(rf2009nox1)[,1],decreasing = TRUE),optional = T)
names(imp.nox1) <- "% Inc MSE"
imp.nox1

   % Inc MSE
x2 37.369704
x4 11.817910
x3 11.559375
x5  5.878555
x6  5.533794

Prediction and evaluation on the test set对测试集的预测和评估

test.pred.nox1 <- predict(rf2009nox1,df[df$year==2010,])
RMSE.forest.nox1 <- sqrt(mean((test.pred.nox1-df[df$year==2010,]$y)^2))
RMSE.forest.nox1
[1] 1885.462

MAE.forest.nox1 <- mean(abs(test.pred.nox1-df[df$year==2010,]$y))
MAE.forest.nox1
[1] 302.3382

I am aware that the variable importance refers to the training model and not to the test one, but does this mean that the x1 variable should not be included in the model?我知道变量重要性是指训练 model 而不是测试，但这是否意味着x1变量不应包含在 model 中？

So, should I include x1 in the model?那么，我应该在 model 中包含x1吗？

Answer 1

I think you need more information about the performance of the model.我认为您需要有关 model 性能的更多信息。 With only one test sample you could speculate a lot why the RMSE is better without x1 although x1 has the highest importance.只有一个测试样本，您可以推测很多为什么没有 x1 的 RMSE 会更好，尽管 x1 的重要性最高。 Could be a correlation between variables or explaining from noise in the train set.可能是变量之间的相关性或从训练集中的噪声中解释。

To get more information I would recommend to look at the out of bag error and do hyperparameter optimization with cross-validation.要获得更多信息，我建议查看袋外错误并使用交叉验证进行超参数优化。 If you see the same behavior after testing different Test datasets you could do cross-validation with and without x1.如果您在测试不同的测试数据集后看到相同的行为，您可以使用和不使用 x1 进行交叉验证。

Hope its helpful希望它有帮助

随机森林中的变量选择和预测精度

问题描述

Load the data加载数据

Random Forest using all variables使用所有变量的随机森林

Variable importance可变重要性

Prediction and evaluation on the test set对测试集的预测和评估

Random Forest excluding x1不包括 x1 的随机森林

Variable importance可变重要性

Prediction and evaluation on the test set对测试集的预测和评估

1 个解决方案

解决方案1
1 已采纳 2020-05-01 15:47:36

随机森林中的变量选择和预测精度

问题描述

Load the data加载数据

Random Forest using all variables使用所有变量的随机森林

Variable importance可变重要性

Prediction and evaluation on the test set对测试集的预测和评估

Random Forest excluding x1不包括 x1 的随机森林

Variable importance可变重要性

Prediction and evaluation on the test set对测试集的预测和评估

1 个解决方案

解决方案1 1 已采纳 2020-05-01 15:47:36

解决方案1
1 已采纳 2020-05-01 15:47:36