简体   繁体   中英

Overcoming Multicollinearity in Random Forest Regression and still keeping all variables in the model

I am new to Random Forest Regression. I have 300 Continuous variables ( 299 predictors and 1 target)in prep1, where some predictors are highly correlated. The problem is that I still need to get the importance value for each one of the predictors , so eliminating some is not an option.

Here are my questions:

1) Is there a way to choose for each tree only variables that are not highly correlated, if yes, how should the below code be adjusted?

2) Assuming yes to 1), will this take care of the multi-collinearity problem?

  bound <- floor(nrow(prep1)/2)         
  df <- prep1[sample(nrow(prep1)), ]            
  train <- df[1:bound, ]             
  test <- df[(bound+1):nrow(df), ]    
  modelFit <- randomForest(continuous_target ~., data = train)
  prediction <- predict(modelFit, test)  

Random Forest has the nature of selecting samples with replacement as well as selecting subsets of features on those samples randomly. As per your scenario, given that you don't have skewness in the response variable, building LARGE NUMBER of trees should give you importance for all of the variables. Though this should increase the computational complexity as you for every bag you are capturing the same importance multiple number of times. Also multicollinearity won't affect the predictive power.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM