简体繁体 English

XGBoost 训练时间好像太长了

[英]XGBoost training time seems to be too long

原文 2020-01-06 01:42:04 4 1 python/ machine-learning/ xgboost/ training-data

I am trying to train an XGBoost classifier in Python using the xgboost package.我正在尝试使用 xgboost 包在 Python 中训练 XGBoost 分类器。 I am using the defaults on all the parameters for the classifier and my training set has around 16,000 elements and 180,000 features for each element.我在分类器的所有参数上使用默认值，我的训练集有大约 16,000 个元素和每个元素的 180,000 个特征。 I am not using the gpu to train the model, but still, the training process has taken more than five hours and is still going.我没有使用 gpu 来训练模型，但是，训练过程已经花费了五个多小时并且仍在继续。 I have 32GB of RAM and a 6 core Intel I7.我有 32GB 内存和 6 核 Intel I7。 I am wondering if this is normal time for training this classifier with the amount of data I have because I have heard of people training the model in a couple of minutes.我想知道这是否是用我拥有的数据量训练这个分类器的正常时间，因为我听说有人在几分钟内训练模型。

1 个解决方案

If training time is concern then one can switch the tree growing policy tree_method to hist which is histogram based method.如果关注训练时间，则可以将树生长策略tree_method为hist ，这是基于直方图的方法。 With GPU it should be set to gpu_hist .对于 GPU，它应该设置为gpu_hist 。 You can find more details about its xgboost implementation here http://arxiv.org/abs/1603.02754您可以在此处找到有关其 xgboost 实现的更多详细信息http://arxiv.org/abs/1603.02754

This is the secret sauce which leads to super fast training without much compromise in the solution quality.这是在不影响解决方案质量的情况下实现超快速训练的秘诀。 In fact GPU based training and even lightGBM etc relies on histogram based techniques for faster training and subsequently iterations/experiments which matters a lot in time constrained kaggle type competitions.事实上，基于 GPU 的训练甚至 lightGBM 等都依赖于基于直方图的技术来进行更快的训练和随后的迭代/实验，这在时间受限的 kaggle 类型比赛中非常重要。 hist may cut training time to half or less and gpu_hist on gpu may take it to minutes. hist可能会将训练时间减少一半或更少， gpu_hist gpu 上的gpu_hist可能需要几分钟。

PS: I would suggest to reduce the dimensionality of your data (16k X 180k) by removing correlated/rank-correlated features which will further improve not only your training time but also model performance. PS：我建议通过删除相关/等级相关的特征来降低数据的维度（16k X 180k），这不仅会进一步改善您的训练时间，还可以改善模型性能。