为什么我的 XBGoost model 对训练和测试数据集有很好的准确性，但在预测保留数据集时却很差？

Question

I'm currently working on a XGBoost regression model to predict ticket bookings.我目前正在研究 XGBoost 回归 model 来预测机票预订。 My issue is that my model has a good accuracy for the training set (around 96%) and for the testing set (around 94%) but when I try to use the model to predict my booking on another held out dataset the accuracy on this one drop to 82%.我的问题是我的 model 对训练集（大约 96%）和测试集（大约 94%）有很好的准确性，但是当我尝试使用 model 来预测我在另一个保留数据集上的预订时，这个准确性下降到 82%。 I tried switching some data from my testing set to this held out set and the accuracy is still pretty bad, even though the model can efficiently predict these data when they're inside my testing set.我尝试将一些数据从我的测试集中切换到这个保留集，但准确性仍然很差，即使 model 可以在我的测试集中有效地预测这些数据。 I assume I'm doing something wrong but I can't figure out what.我认为我做错了什么，但我不知道是什么。 Any help would be appreciated, thanks任何帮助将不胜感激，谢谢

Here's the XGBoost model part of my code:这是我的代码的 XGBoost model 部分：

import xgboost as xgb
from sklearn.metrics import mean_squared_error

X_conso, y_conso = data_conso2.iloc[:,:-1],data_conso2.iloc[:,-1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_conso, y_conso, test_size=0.3, random_state=20)

d_train = xgb.DMatrix(X_train, label = y_train)
d_test = xgb.DMatrix(X_test, label = y_test)
d_fcst_held_out = xgb.DMatrix(X_fcst_held_out)


params = {'p_colsample_bytree_conso' : 0.9, 
          'p_colsample_bylevel_conso': 0.9,
          'p_colsample_bynode_conso': 0.9,
          'p_learning_rate_conso': 0.3,
          'p_max_depth_conso': 10,
          'p_alpha_conso': 3,
          'p_n_estimators_conso': 10,
          'p_gamma_conso': 0.8}

steps = 100

watchlist = [(d_train, 'train'), (d_test, 'test')]
model = xgb.train(params, d_train, steps, watchlist, early_stopping_rounds = 50)

preds_train = model.predict(d_train)
preds_test = model.predict(d_test)
preds_fcst = model.predict(d_fcst_held_out)

And my accuracy levels :

Error train: 4.524787%
Error test: 5.978759%
Error fcst: 18.008451%

Answer 1

This is generally normal, the unseen data usually has lower accuracy.这通常是正常的，看不见的数据通常具有较低的准确性。

To improve accuracy on data you may optimize your parameters using for example optuna .为了提高数据的准确性，您可以使用例如optuna优化您的参数。

为什么我的 XBGoost model 对训练和测试数据集有很好的准确性，但在预测保留数据集时却很差？

问题描述

1 个解决方案

解决方案1
0 2022-01-07 07:05:41

为什么我的 XBGoost model 对训练和测试数据集有很好的准确性，但在预测保留数据集时却很差？

问题描述

1 个解决方案

解决方案1 0 2022-01-07 07:05:41

解决方案1
0 2022-01-07 07:05:41