简体   繁体   中英

How to plot the random forest tree corresponding to best parameter

Python: 3.6

Windows: 10

I have few question regarding Random Forest and problem at hand:

I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestRegressor()
    # Random search of parameters, using 3 fold cross validation, 
    # search across 100 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
    # Fit the random search model
    rf_random.fit(X_train, y_train)

    rf_random.best_params_

The best parameter came out to be is:

    {'n_estimators': 1000,
     'min_samples_split': 5,
     'min_samples_leaf': 1,
     'max_features': 'auto',
     'max_depth': 5,
     'bootstrap': True}
  1. How can I plot this tree using above parameter?

  2. My dependent variable y lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense

  3. The above data is skewed that means the dependent or y variable has value= 1 in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87 so on.

  4. Since my data has only binary {0,1} and categorical variables {A,B,C} . Do I need to convert it into one-hot encoding variable for using random forest?

Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):

Fitting your RandomizedSearchCV has resulted in an rf_random.best_estimator_ , which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000 ).

According to the docs , a fitted RandomForestRegressor includes an attribute:

estimators_: list of DecisionTreeRegressor

The collection of fitted sub-estimators.

So, to plot any individual tree of your Random Forest, you should use either

from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])

or

from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])

for the desired k in [0, 999] in your case ( [0, n_estimators-1] in the general case).

Allow me to take a step back before answering your questions.

Ideally one should drill down further on the best_params_ of RandomizedSearchCV output through GridSearchCV . RandomizedSearchCV will go over your parameters without trying out all the possible options. Then once you have the best_params_ of RandomizedSearchCV , we can investigate all the possible options across a more narrower range.

You did not include random_grid parameters in your code input, but I would expect you to do a GridSearchCV like this:

# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
    'max_depth': [4, 5, 6],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [4, 5, 6],
    'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2, random_state=56)

What the above will do is to go through all the possible combinations of parameters in the param_grid and give you the best parameter.

Now coming to your questions:

  1. Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.

  2. Did you try a simple linear regression first?

  3. This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data

  4. Yes, categorical variables need to be converted to dummy variables before fitting a random forest

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM