简体   繁体   English

为什么使用随机森林来确保我的决策树 model 不会过拟合?

[英]Why using random forest to make sure my decision tree model doesn't overfit?

My Code:我的代码:

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

# Using random forest to make sure my model doesn't overfit

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 20) #n_esitmators value can be changed according to need
clf = clf.fit(ft,pima['brand'])

I want to know the best explanation about above application of random forest classifier in the code.我想知道关于上述随机森林分类器在代码中的应用的最佳解释。 what is the reason of using this random forest classifier in this time?这次使用这个随机森林分类器的原因是什么?

Yikes?哎呀? What is your question actually about?你的问题到底是什么? What's the end game here, Basically.基本上,这里的最终游戏是什么。 the Random Forest algo consists is an ensemble of decision trees.随机森林算法由决策树组成。 A single decision tree is very sensitive to data variations.单个决策树对数据变化非常敏感。 It can easily overfit to noise in the data.它很容易过度拟合数据中的噪声。 The Random Forest with only one tree will overfit to data as well because it is the same as a single decision tree.只有一棵树的随机森林也会过度拟合数据,因为它与单棵决策树相同。

When we add trees to the Random Forest then the tendency to overfitting should decrease (thanks to bagging and random feature selection).当我们将树添加到随机森林时,过度拟合的趋势应该会降低(感谢 bagging 和随机特征选择)。 However, the generalization error will not go to zero.但是,泛化误差不会 go 为零。 The variance of generalization error will approach to zero with more trees added but the bias will not!随着更多树的添加,泛化误差的方差将接近于零,但偏差不会!

Run the example below:运行以下示例:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans 
from scipy.cluster.vq import kmeans,vq
import sklearn.model_selection as model_selection
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Import CSV mtcars
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
# Edit element of column header
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)

X1= data.iloc[:,1:12]
Y1= data.iloc[:,-1]

#lets try to plot Decision tree to find the feature importance
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier(criterion='entropy', random_state=1)
tree.fit(X1, Y1)

imp= pd.DataFrame(index=X1.columns, data=tree.feature_importances_, columns=['Imp'] )
imp.sort_values(by='Imp', ascending=False)

sns.barplot(x=imp.index.tolist(), y=imp.values.ravel(), palette='coolwarm')

X=data[['cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
y=data['mpg']

# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


rf = RandomForestRegressor(n_estimators=50)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_full_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_full_trees)
print("RF with full trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_pruned_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_pruned_trees)
print("RF with pruned trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=1)
for iter in range(50):
    rf.fit(X_train, y_train)
    y_train_predicted = rf.predict(X_train)
    y_test_predicted = rf.predict(X_test)
    mse_train = mean_squared_error(y_train, y_train_predicted)
    mse_test = mean_squared_error(y_test, y_test_predicted)
    print("Iteration: {} Train mse: {} Test mse: {}".format(iter, mse_train, mse_test))
    rf.n_estimators += 1


import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

# To keep the size of the tree small, I set max_depth = 3.
# Fit the regressor, set max_depth = 3
regr = DecisionTreeRegressor(max_depth=3, random_state=1234)
model = regr.fit(X, y)

# 1
text_representation = tree.export_text(regr)
print(text_representation)


fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, feature_names=X.columns, filled=True)

# 2
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, 
                   feature_names=X.columns,  
                   filled=True)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我的随机森林分类器 model 中每个决策树的 max_depth 都相同? - Why the max_depth of every decision tree in my random forest classifier model are the same? 我不确定为什么决策树和随机森林显示 100% 准确度? - I am not sure why decision tree and random forest is displaying 100% accuracy? 决策树过拟合测试 - Decision tree overfit test jupyter 错误:无法在随机森林中将决策树视为 png - jupyter error: can't view decision tree as png in random forest 为什么单棵树的随机森林比决策树分类器好得多? - Why is Random Forest with a single tree much better than a Decision Tree classifier? 随机森林 - 使 null 值在决策树中始终有自己的分支 - Random Forest - make null values always have their own branch in a decision tree 从 sklearn 随机森林回归器可视化决策树 - Visualizing a decision tree from a sklearn random forest regressor 如何寻找随机森林树/决策树的特征? - How can look for the features of random forest tree/decision treee? n_jobs=-1 的 GridSearchCV 不适用于决策树/随机森林分类 - GridSearchCV with n_jobs=-1 is not working for Decision Tree/Random Forest classification 为什么我的决策树创建的拆分实际上并未划分样本? - Why is my decision tree creating a split that doesn't actually divide the samples?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM