[英]How to test a Random Forest regression model for Overfitting?
I'm using RandomForest for a regression model and wanted to see if my model is overfitting.我正在使用 RandomForest 进行回归 model 并想看看我的 model 是否过拟合。 Here is what I did:
这是我所做的:
I use GridSearchCV for hyperparameter tuning and then create a RandomForestRegressor with those parameters:我使用 GridSearchCV 进行超参数调整,然后使用这些参数创建一个 RandomForestRegressor:
RF = RandomForestRegressor(n_estimators=b['n_estimators'], max_depth=b['max_depth'], min_samples_leaf=b['min_samples_leaf'], random_state=0)
Then I fit the model using the train dataset:然后我使用训练数据集拟合 model:
model = RF.fit(x_train, y_train.values.ravel())
Then I predict with the test dataset:然后我用测试数据集预测:
y_pred = model.predict(x_test)
Then I did the exact same with x_train instead of x_test:然后我用 x_train 而不是 x_test 做了同样的事情:
y_pred = model.predict(x_train)
Here are the results that I achieve:以下是我取得的结果:
Test Data:
MAE: 15.11
MAPE: 26.98%
Train Data:
MAE: 6.17
MAPE: 10.97%
As you can see there is a pretty significant difference.正如你所看到的,有一个非常显着的差异。 Do I have a big problem with overfitting or am I doing something wrong when using x_train to predict?
在使用 x_train 进行预测时,我是否有过度拟合的大问题,或者我做错了什么?
Formulas for the MAE and MAPE: MAE 和 MAPE 的公式:
MAE:梅:
mae = sklearn.metrics.mean_absolute_error(y_test, y_pred)
MAPE:地图:
def percentage_error(actual, predicted):
res = np.empty(actual.shape)
for j in range(actual.shape[0]):
if actual[j] != 0:
res[j] = (actual[j] - predicted[j]) / actual[j]
else:
res[j] = predicted[j] / np.mean(actual)
return res
def mean_absolute_percentage_error(y_test, y_pred):
return np.mean(np.abs(percentage_error(np.asarray(y_test), np.asarray(y_pred)))) * 100
Source for the MAPE formula: https://stackoverflow.com/a/59033147/10603410 MAPE 公式的来源: https://stackoverflow.com/a/59033147/10603410
There is not a "If this number x
is less than y
then we are overfitting", it is you who need to conclude if we are overfitting.不存在“如果这个数字
x
小于y
则我们过拟合”,如果我们过拟合,您需要得出结论。
By definition if the test error is "much bigger than the train error", you are overfitting, but this "much bigger" is not defined - if depends on your data and what the model is used for.根据定义,如果测试错误“比训练错误大得多”,则说明您过度拟合,但未定义“大得多” - 如果取决于您的数据以及 model 的用途。 If your data is really "easy" (ie easy to regress) you would expect a close train/test error.
如果您的数据真的很“容易”(即容易回归),您会期望接近训练/测试错误。 If it is really noisy you could accept a bigger difference
如果它真的很吵,你可以接受更大的差异
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.