How to plot the random forest tree corresponding to best parameter

Question

Python: 3.6

Windows: 10

I have few question regarding Random Forest and problem at hand:

I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestRegressor()
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
    # Fit the random search model
    rf_random.fit(X_train, y_train)

    rf_random.best_params_

The best parameter came out to be is:

    {'n_estimators': 1000,
     'min_samples_split': 5,
     'min_samples_leaf': 1,
     'max_features': 'auto',
     'max_depth': 5,
     'bootstrap': True}

How can I plot this tree using above parameter?
My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense
The above data is skewed that means the dependent or y variable has value= 1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.
Since my data has only binary {0,1} and categorical variables {A,B,C} . Do I need to convert it into one-hot encoding variable for using random forest?

Answer 1

Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):

Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_ , which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000 ).

According to the docs , a fitted RandomForestRegressor includes an attribute:

estimators_: list of DecisionTreeRegressor

The collection of fitted sub-estimators.

So, to plot any individual tree of your Random Forest, you should use either

from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])

or

from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])

for the desired k in [0, 999] in your case ( [0, n_estimators-1] in the general case).

Answer 2

Allow me to take a step back before answering your questions.

Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV . RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV , we can investigate all the possible options across a more narrower range.

You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:

# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
    'max_depth': [4, 5, 6],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2, random_state=56)

What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.

Now coming to your questions:

Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.
Did you try a simple linear regression first?
This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data
Yes, categorical variables need to be converted to dummy variables before fitting a random forest

How to plot the random forest tree corresponding to best parameter

Question

2 answers

solution1
2 ACCPTED 2020-06-03 17:02:58

solution2
1 2020-05-31 08:39:57

How to plot the random forest tree corresponding to best parameter

Question

2 answers

solution1 2 ACCPTED 2020-06-03 17:02:58

solution2 1 2020-05-31 08:39:57

solution1
2 ACCPTED 2020-06-03 17:02:58

solution2
1 2020-05-31 08:39:57