简体繁体 English

R中回归森林的特征选择和预测精度

[英]Feature selection and prediction accuracy in regression Forest in R

原文 2017-08-29 09:51:00 0 1 r/ regression/ random-forest/ feature-selection

I am attempting to solve a regression problem where the input feature set is of size ~54. 我正在尝试解决输入要素集大小约为54的回归问题。

Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (ie, Random forest regression). 将OLS线性回归与单个预测变量“ X1”一起使用时，我无法解释Y的变化-因此，我试图使用回归森林（即，随机森林回归）找到其他重要特征。 The selected 'X1' is later found to be the most important feature. 后来发现所选的“ X1”是最重要的功能。

My dataset has ~14500 entries. 我的数据集约有14500个条目。 I have separated it into training and test sets in the ratio 9:1. 我按9：1的比例将其分为训练集和测试集。

I have the following questions: 我有以下问题：

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data? 在尝试查找重要特征时，应该对整个数据集还是仅对训练数据运行回归林？
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power? 找到重要特征后，是否应该使用头几个特征重新构建模型，以查看特征选择是否以较小的预测速度加快了计算速度？
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. 现在，我已经使用训练集和所有功能构建了模型，并且将其用于测试集的预测。 I am calculating the MSE and R-squared from the training set. 我正在从训练集中计算MSE和R平方。 I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). 我在训练数据上获得了较高的MSE，而在R2上却得到了较低的R2（如下所示）。 Is this unusual? 这不寻常吗？

forest <- randomForest(fmla, dTraining, ntree=501, importance=T) 森林<-randomForest（fmla，dTraining，ntree = 501，重要性= T）

mean((dTraining$y - predict(forest, data=dTraining))^2) 均值（（dTraining $ y-预测（forest，data = dTraining））^ 2）

0.9371891 0.9371891

rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining)) rSquared（dTraining $ y，dTraining $ y-预测（forest，data = dTraining））

0.7431078 0.7431078

mean((dTest$y - predict(forest, newdata=dTest))^2) 均值（（dTest $ y-预测（forest，newdata = dTest））^ 2）

0.009771256 0.009771256

rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest)) rSquared（dTest $ y，dTest $ y-预测（forest，newdata = dTest））

0.9950448 0.9950448

Please suggest. 请提出建议。 Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good? 是否建议R平方和MSE是解决此问题的良好指标，或者我是否需要查看一些其他指标来评估模型是否良好？

1 个解决方案

You should also try Cross Validated here 您还应该在这里尝试Cross Validated

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data? 在尝试查找重要特征时，应该对整个数据集还是仅对训练数据运行回归林？

Only on the training data. 仅关于训练数据。 You want to prevent overfitting, which is why you do a train-test split in the first place. 您想防止过度拟合，这就是为什么首先进行火车测试拆分的原因。

Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power? 找到重要特征后，是否应该使用头几个特征重新构建模型，以查看特征选择是否以较小的预测速度加快了计算速度？

Yes, but the purpose of feature selection is not necessarily to speed up computation . 是的，但是特征选择的目的不一定是speed up computation 。 With infinite features, it is possible to fit any pattern of data (ie, overfitting). 使用无限功能，可以拟合任何数据模式（即过度拟合）。 With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features. 通过功能选择，您希望仅使用一些“可靠”功能来防止过拟合。

For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. 现在，我已经使用训练集和所有功能构建了模型，并且将其用于测试集的预测。 I am calculating the MSE and R-squared from the training set. 我正在从训练集中计算MSE和R平方。 I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). 我在训练数据上获得了较高的MSE，而在R2上却得到了较低的R2（如下所示）。 Is this unusual? 这不寻常吗？

Yes, it's unusual. 是的，这很不寻常。 You want low MSE and high R2 values for both your training and test data. 您需要训练和测试数据的低MSE和高R2值。 (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. （我会仔细检查您的计算。）如果您的训练数据得到的MSE较高而R2较低，则意味着您的训练不佳，这非常令人惊讶。 Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest)) ? 另外，我还没有使用rSquared但也许您想要rSquared(dTest$y, predict(forest, newdata=dTest)) ？