如何测试随机森林回归 model 的过度拟合？

Question

I'm using RandomForest for a regression model and wanted to see if my model is overfitting.我正在使用 RandomForest 进行回归 model 并想看看我的 model 是否过拟合。 Here is what I did:这是我所做的：

I use GridSearchCV for hyperparameter tuning and then create a RandomForestRegressor with those parameters:我使用 GridSearchCV 进行超参数调整，然后使用这些参数创建一个 RandomForestRegressor：

RF = RandomForestRegressor(n_estimators=b['n_estimators'], max_depth=b['max_depth'], min_samples_leaf=b['min_samples_leaf'], random_state=0)

Then I fit the model using the train dataset:然后我使用训练数据集拟合 model：

model = RF.fit(x_train, y_train.values.ravel())

Then I predict with the test dataset:然后我用测试数据集预测：

y_pred = model.predict(x_test)

Then I did the exact same with x_train instead of x_test:然后我用 x_train 而不是 x_test 做了同样的事情：

y_pred = model.predict(x_train)

Here are the results that I achieve:以下是我取得的结果：

Test Data:
MAE: 15.11
MAPE: 26.98%

Train Data:
MAE: 6.17
MAPE: 10.97%

As you can see there is a pretty significant difference.正如你所看到的，有一个非常显着的差异。 Do I have a big problem with overfitting or am I doing something wrong when using x_train to predict?在使用 x_train 进行预测时，我是否有过度拟合的大问题，或者我做错了什么？

Formulas for the MAE and MAPE: MAE 和 MAPE 的公式：

MAE:梅：

mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)

MAPE:地图：

def percentage_error(actual, predicted):
   res = np.empty(actual.shape)
   for j in range(actual.shape[0]):
       if actual[j] != 0:
           res[j] = (actual[j] - predicted[j]) / actual[j]
       else:
           res[j] = predicted[j] / np.mean(actual)
   return res

def mean_absolute_percentage_error(y_test, y_pred): 
   return np.mean(np.abs(percentage_error(np.asarray(y_test), np.asarray(y_pred)))) * 100

Source for the MAPE formula: https://stackoverflow.com/a/59033147/10603410 MAPE 公式的来源： https://stackoverflow.com/a/59033147/10603410

Answer 1

There is not a "If this number x is less than y then we are overfitting", it is you who need to conclude if we are overfitting.不存在“如果这个数字x小于y则我们过拟合”，如果我们过拟合，您需要得出结论。

By definition if the test error is "much bigger than the train error", you are overfitting, but this "much bigger" is not defined - if depends on your data and what the model is used for.根据定义，如果测试错误“比训练错误大得多”，则说明您过度拟合，但未定义“大得多” - 如果取决于您的数据以及 model 的用途。 If your data is really "easy" (ie easy to regress) you would expect a close train/test error.如果您的数据真的很“容易”（即容易回归），您会期望接近训练/测试错误。 If it is really noisy you could accept a bigger difference如果它真的很吵，你可以接受更大的差异

如何测试随机森林回归 model 的过度拟合？

问题描述

1 个解决方案

解决方案1
1 2020-12-17 09:25:59

如何测试随机森林回归 model 的过度拟合？

问题描述

1 个解决方案

解决方案1 1 2020-12-17 09:25:59

解决方案1
1 2020-12-17 09:25:59