简体   繁体   English

为什么我的 XBGoost model 对训练和测试数据集有很好的准确性,但在预测保留数据集时却很差?

[英]Why does my XBGoost model have a good accuracy for training and testing dataset, but poor one for predicting an held out dataset?

I'm currently working on a XGBoost regression model to predict ticket bookings.我目前正在研究 XGBoost 回归 model 来预测机票预订。 My issue is that my model has a good accuracy for the training set (around 96%) and for the testing set (around 94%) but when I try to use the model to predict my booking on another held out dataset the accuracy on this one drop to 82%.我的问题是我的 model 对训练集(大约 96%)和测试集(大约 94%)有很好的准确性,但是当我尝试使用 model 来预测我在另一个保留数据集上的预订时,这个准确性下降到 82%。 I tried switching some data from my testing set to this held out set and the accuracy is still pretty bad, even though the model can efficiently predict these data when they're inside my testing set.我尝试将一些数据从我的测试集中切换到这个保留集,但准确性仍然很差,即使 model 可以在我的测试集中有效地预测这些数据。 I assume I'm doing something wrong but I can't figure out what.我认为我做错了什么,但我不知道是什么。 Any help would be appreciated, thanks任何帮助将不胜感激,谢谢

Here's the XGBoost model part of my code:这是我的代码的 XGBoost model 部分:

import xgboost as xgb
from sklearn.metrics import mean_squared_error

X_conso, y_conso = data_conso2.iloc[:,:-1],data_conso2.iloc[:,-1]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_conso, y_conso, test_size=0.3, random_state=20)

d_train = xgb.DMatrix(X_train, label = y_train)
d_test = xgb.DMatrix(X_test, label = y_test)
d_fcst_held_out = xgb.DMatrix(X_fcst_held_out)


params = {'p_colsample_bytree_conso' : 0.9, 
          'p_colsample_bylevel_conso': 0.9,
          'p_colsample_bynode_conso': 0.9,
          'p_learning_rate_conso': 0.3,
          'p_max_depth_conso': 10,
          'p_alpha_conso': 3,
          'p_n_estimators_conso': 10,
          'p_gamma_conso': 0.8}

steps = 100

watchlist = [(d_train, 'train'), (d_test, 'test')]
model = xgb.train(params, d_train, steps, watchlist, early_stopping_rounds = 50)

preds_train = model.predict(d_train)
preds_test = model.predict(d_test)
preds_fcst = model.predict(d_fcst_held_out)

And my accuracy levels :

Error train: 4.524787%
Error test: 5.978759%
Error fcst: 18.008451%

This is generally normal, the unseen data usually has lower accuracy.这通常是正常的,看不见的数据通常具有较低的准确性。

To improve accuracy on data you may optimize your parameters using for example optuna .为了提高数据的准确性,您可以使用例如optuna优化您的参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我重新训练的模型准确率很差? - Why does my retrained model have poor accuracy? 测试准确性差,同时具有很好的训练和验证准确性 - Poor testing accuracy, while having very good training and validation accuracy 我的 model 是否应该始终在训练数据集上提供 100% 的准确度? - Should my model always give 100% accuracy on Training dataset? 训练准确度好但验证准确度差 - Good training accuracy but poor validation accuracy 良好的训练/验证准确度,但测试准确度差 - Good training/validation accuracy but poor test accuracy 为什么验证准确度(或训练准确度)与数据集的数量不完全匹配? - why the validation accuracy (or training accuracy) is not exactly matched with the number of dataset? 良好的训练准确度和验证准确度,但预测准确度较差 - Good training accuracy and validaiton accuracy but poor prediction accuracy MNIST数据集上的ResNet的准确性没有增加 - Accuracy does not increase in my ResNet on MNIST dataset 加载数据集以训练 model - Loading dataset for Training the model 为什么我的 KNeighborsRegressor 训练准确度下降而测试准确度增加? - Why is my KNeighborsRegressor training accuracy decreasing and testing accuracy increasing?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM