Python: 3.6
Windows: 10
I have few question regarding Random Forest and problem at hand:
I am using Gridsearch to run regression problem using Random Forest. I want to plot the tree corresponding to best fit parameter that gridsearch has found out. Here is the code.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 50, cv = 5, verbose=2, random_state=56, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
The best parameter came out to be is:
{'n_estimators': 1000,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_features': 'auto',
'max_depth': 5,
'bootstrap': True}
How can I plot this tree using above parameter?
My dependent variable y
lies in range [0,1] (continuous) and all predictor variables are either binary or categorical. Which algorithm in general can work well fot this input and output feature space. I tried with Random Forest. (Didn't give that good result). Note here y
variable is a kind of ratio therefore its between 0 and 1. Example: Expense on food/Total Expense
The above data is skewed that means the dependent or y
variable has value= 1
in 60% of data and somewhere between 0 and 1 in rest of data. like 0.66, 0.87
so on.
Since my data has only binary {0,1}
and categorical variables {A,B,C}
. Do I need to convert it into one-hot encoding
variable for using random forest?
Regarding the plot (I am afraid that your other questions are way too-broad for SO, where the general idea is to avoid asking multiple questions at the same time):
Fitting your RandomizedSearchCV
has resulted in an rf_random.best_estimator_
, which in itself is a random forest with the parameters shown in your question (including 'n_estimators': 1000
).
According to the docs , a fitted RandomForestRegressor
includes an attribute:
estimators_: list of DecisionTreeRegressor
The collection of fitted sub-estimators.
So, to plot any individual tree of your Random Forest, you should use either
from sklearn import tree
tree.plot_tree(rf_random.best_estimator_.estimators_[k])
or
from sklearn import tree
tree.export_graphviz(rf_random.best_estimator_.estimators_[k])
for the desired k
in [0, 999]
in your case ( [0, n_estimators-1]
in the general case).
Allow me to take a step back before answering your questions.
Ideally one should drill down further on the best_params_
of RandomizedSearchCV
output through GridSearchCV
. RandomizedSearchCV
will go over your parameters without trying out all the possible options. Then once you have the best_params_
of RandomizedSearchCV
, we can investigate all the possible options across a more narrower range.
You did not include random_grid
parameters in your code input, but I would expect you to do a GridSearchCV like this:
# Create the parameter grid based on the results of RandomizedSearchCV
param_grid = {
'max_depth': [4, 5, 6],
'min_samples_leaf': [1, 2],
'min_samples_split': [4, 5, 6],
'n_estimators': [990, 1000, 1010]
}
# Fit the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, random_state=56)
What the above will do is to go through all the possible combinations of parameters in the param_grid
and give you the best parameter.
Now coming to your questions:
Random forests are a combination of multiple trees - so you do not have only 1 tree that you can plot. What you can instead do is to plot 1 or more the individual trees used by the random forests. This can be achieved by the plot_tree function. Have a read of the documentation and this SO question to understand it more.
Did you try a simple linear regression first?
This would impact what kind of accuracy metrics you would utilize to assess your model's fit/accuracy. Precision, recall & F1 scores come to mind when dealing with unbalanced/skewed data
Yes, categorical variables need to be converted to dummy variables before fitting a random forest
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.