简体繁体中英

Feature selection and prediction accuracy in regression Forest in R

原文 2017-08-29 09:51:00 9 1 r/ regression/ random-forest/ feature-selection

I am attempting to solve a regression problem where the input feature set is of size ~54.

Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (ie, Random forest regression). The selected 'X1' is later found to be the most important feature.

My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.

I have the following questions:

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?

forest <- randomForest(fmla, dTraining, ntree=501, importance=T)

mean((dTraining$y - predict(forest, data=dTraining))^2)

0.9371891

rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))

0.7431078

mean((dTest$y - predict(forest, newdata=dTest))^2)

0.009771256

rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))

0.9950448

Please suggest. Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?

1 answers

You should also try Cross Validated here

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?

Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.

Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?

Yes, but the purpose of feature selection is not necessarily to speed up computation . With infinite features, it is possible to fit any pattern of data (ie, overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.

For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?

Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest)) ?

Variable selection in Random forest and prediction accuracy

Poor Accuracy Prediction with random forest in R

Feature Selection for Regression Models in R

Getting random forest prediction accuracy for a continuous variable in R

R random forest feature selection based on AUC

Feature Selection for QSAR data in R for regression analysis

R Random Forest prediction not working

Prediction using Random Forest in R

Get the accuracy of a random forest in R

Regression and Prediction using R

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Variable selection in Random forest and prediction accuracy Poor Accuracy Prediction with random forest in R Feature Selection for Regression Models in R Getting random forest prediction accuracy for a continuous variable in R R random forest feature selection based on AUC Feature Selection for QSAR data in R for regression analysis R Random Forest prediction not working Prediction using Random Forest in R Get the accuracy of a random forest in R Regression and Prediction using R

Related Tags

Feature selection and prediction accuracy in regression Forest in R

Question

1 answers

solution1 0 2017-08-29 10:59:19

solution1
0 2017-08-29 10:59:19