为什么使用随机森林来确保我的决策树 model 不会过拟合？

Question

我的代码：

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

# Using random forest to make sure my model doesn't overfit

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 20) #n_esitmators value can be changed according to need
clf = clf.fit(ft,pima['brand'])

我想知道关于上述随机森林分类器在代码中的应用的最佳解释。 这次使用这个随机森林分类器的原因是什么？

Answer 1

哎呀？ 你的问题到底是什么？ 基本上，这里的最终游戏是什么。 随机森林算法由决策树组成。 单个决策树对数据变化非常敏感。 它很容易过度拟合数据中的噪声。 只有一棵树的随机森林也会过度拟合数据，因为它与单棵决策树相同。

当我们将树添加到随机森林时，过度拟合的趋势应该会降低（感谢 bagging 和随机特征选择）。 但是，泛化误差不会 go 为零。 随着更多树的添加，泛化误差的方差将接近于零，但偏差不会！

运行以下示例：

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import MiniBatchKMeans 
from scipy.cluster.vq import kmeans,vq
import sklearn.model_selection as model_selection
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Import CSV mtcars
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
# Edit element of column header
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)

X1= data.iloc[:,1:12]
Y1= data.iloc[:,-1]

#lets try to plot Decision tree to find the feature importance
from sklearn.tree import DecisionTreeClassifier
tree= DecisionTreeClassifier(criterion='entropy', random_state=1)
tree.fit(X1, Y1)

imp= pd.DataFrame(index=X1.columns, data=tree.feature_importances_, columns=['Imp'] )
imp.sort_values(by='Imp', ascending=False)

sns.barplot(x=imp.index.tolist(), y=imp.values.ravel(), palette='coolwarm')

X=data[['cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']]
y=data['mpg']

# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)


rf = RandomForestRegressor(n_estimators=50)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_full_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_full_trees)
print("RF with full trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=50, min_samples_leaf=25)
rf.fit(X_train, y_train)
y_train_predicted = rf.predict(X_train)
y_test_predicted_pruned_trees = rf.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_predicted)
mse_test = mean_squared_error(y_test, y_test_predicted_pruned_trees)
print("RF with pruned trees, Train MSE: {} Test MSE: {}".format(mse_train, mse_test))


rf = RandomForestRegressor(n_estimators=1)
for iter in range(50):
    rf.fit(X_train, y_train)
    y_train_predicted = rf.predict(X_train)
    y_test_predicted = rf.predict(X_test)
    mse_train = mean_squared_error(y_train, y_train_predicted)
    mse_test = mean_squared_error(y_test, y_test_predicted)
    print("Iteration: {} Train mse: {} Test mse: {}".format(iter, mse_train, mse_test))
    rf.n_estimators += 1


import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

# To keep the size of the tree small, I set max_depth = 3.
# Fit the regressor, set max_depth = 3
regr = DecisionTreeRegressor(max_depth=3, random_state=1234)
model = regr.fit(X, y)

# 1
text_representation = tree.export_text(regr)
print(text_representation)


fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, feature_names=X.columns, filled=True)

# 2
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(regr, 
                   feature_names=X.columns,  
                   filled=True)

为什么使用随机森林来确保我的决策树 model 不会过拟合？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-03-14 02:22:57

为什么使用随机森林来确保我的决策树 model 不会过拟合？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-03-14 02:22:57

解决方案1
0 已采纳 2021-03-14 02:22:57