为什么使用随机森林来确保我的决策树 model 不会过拟合？

Question

My Code:我的代码：

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

# Using random forest to make sure my model doesn't overfit

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 20) #n_esitmators value can be changed according to need
clf = clf.fit(ft,pima['brand'])

I want to know the best explanation about above application of random forest classifier in the code.我想知道关于上述随机森林分类器在代码中的应用的最佳解释。 what is the reason of using this random forest classifier in this time?这次使用这个随机森林分类器的原因是什么？

Answer 1

Yikes?哎呀？ What is your question actually about?你的问题到底是什么？ What's the end game here, Basically.基本上，这里的最终游戏是什么。 the Random Forest algo consists is an ensemble of decision trees.随机森林算法由决策树组成。 A single decision tree is very sensitive to data variations.单个决策树对数据变化非常敏感。 It can easily overfit to noise in the data.它很容易过度拟合数据中的噪声。 The Random Forest with only one tree will overfit to data as well because it is the same as a single decision tree.只有一棵树的随机森林也会过度拟合数据，因为它与单棵决策树相同。

When we add trees to the Random Forest then the tendency to overfitting should decrease (thanks to bagging and random feature selection).当我们将树添加到随机森林时，过度拟合的趋势应该会降低（感谢 bagging 和随机特征选择）。 However, the generalization error will not go to zero.但是，泛化误差不会 go 为零。 The variance of generalization error will approach to zero with more trees added but the bias will not!随着更多树的添加，泛化误差的方差将接近于零，但偏差不会！

Run the example below:运行以下示例：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans 
from scipy.cluster.vq import kmeans,vq
import sklearn.model_selection as model_selection
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Import CSV mtcars
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
# Edit element of column header
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)

X1= data.iloc[:,1:12]
Y1= data.iloc[:,-1]

#lets try to plot Decision tree to find the feature importance
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier(criterion='entropy', random_state=1)
tree.fit(X1, Y1)

imp= pd.DataFrame(index=X1.columns, data=tree.feature_importances_, columns=['Imp'] )
imp.sort_values(by='Imp', ascending=False)

sns.barplot(x=imp.index.tolist(), y=imp.values.ravel(), palette='coolwarm')

X=data[['cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
y=data['mpg']

# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


rf = RandomForestRegressor(n_estimators=50)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_full_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_full_trees)
print("RF with full trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_pruned_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_pruned_trees)
print("RF with pruned trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=1)
for iter in range(50):
    rf.fit(X_train, y_train)
    y_train_predicted = rf.predict(X_train)
    y_test_predicted = rf.predict(X_test)
    mse_train = mean_squared_error(y_train, y_train_predicted)
    mse_test = mean_squared_error(y_test, y_test_predicted)
    print("Iteration: {} Train mse: {} Test mse: {}".format(iter, mse_train, mse_test))
    rf.n_estimators += 1


import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

# To keep the size of the tree small, I set max_depth = 3.
# Fit the regressor, set max_depth = 3
regr = DecisionTreeRegressor(max_depth=3, random_state=1234)
model = regr.fit(X, y)

# 1
text_representation = tree.export_text(regr)
print(text_representation)


fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, feature_names=X.columns, filled=True)

# 2
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, 
                   feature_names=X.columns,  
                   filled=True)

为什么使用随机森林来确保我的决策树 model 不会过拟合？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-14 02:22:57

为什么使用随机森林来确保我的决策树 model 不会过拟合？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-14 02:22:57

解决方案1
0 已采纳 2021-03-14 02:22:57