简体   繁体   English

XGBoost 和 scikit-optimize:BayesSearchCV 和 XGBRegressor 不兼容 - 为什么?

[英]XGBoost and scikit-optimize: BayesSearchCV and XGBRegressor are incompatible - why?

I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost .我有一个非常大的数据集(700 万行,54 个特征),我想使用XGBoost拟合回归模型。 To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.为了训练最好的模型,我想使用scikit-optimize BayesSearchCV对不同的超参数组合重复运行拟合,直到找到性能最佳的集合。

For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV .对于给定的超参数集, XGBoost需要很长时间来训练模型,因此为了找到最佳超参数,而无需花费数天时间处理训练折叠、超参数等的每个排列,我想对XGBoostBayesSearchCV进行多线程BayesSearchCV The relevant part of my code looks like this:我的代码的相关部分如下所示:

xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 1))])

xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}

xgb_kfold = KFold(n_splits = 5, random_state = 42)

xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)

xgb_unsm_cv.fit(X_train.values, y_train.values)

However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:然而,我发现,当n_jobs > 1BayesSearchCV号召,配合崩溃,我得到以下错误:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.每当我在BayesSearchCV调用中使用超过 1 个线程时,此错误就会持续存在,并且与我提供的内存无关。

Is this some fundamental incompatibility between XGBoost and scikit-optimize , or can both packages be forced to work together somehow?这是XGBoostscikit-optimize之间的一些根本不兼容,还是可以强制两个包以某种方式一起工作? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform.如果没有某种多线程优化方法,我担心拟合我的模型需要数周时间才能执行。 What can I do to fix this?我能做些什么来解决这个问题?

I don't think the error has something to do with the incompatibility of the libraries.我不认为该错误与库的不兼容有关。 Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).相反,由于您要求两个不同的多线程操作,因此您的内存正在耗尽,因为您的程序试图将完整的数据集放入 RAM 中,而不是一次,而是针对多个实例(取决于线程) 两次

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

Segmentation Fault refers to an error where the system ran out of available memory.分段错误是指系统可用内存不足的错误。

Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)请注意,XGBoost 是一个 RAM 饥饿的野兽,将它与另一个多线程操作结合起来势必会造成损失(个人而言,不建议与日常驱动程序机器一起使用。)

The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article .最可行的解决方案可能是使用谷歌的 TPU 或其他一些云服务(注意成本),或者使用一些技术来减少数据集的大小,以便使用一些统计技术进行处理,例如在这个kaggle notebookData科学 StackExchange 文章

The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.这个想法是,要么升级硬件(金钱成本),要么直接使用单线程 BayesianCV(时间成本),要么使用最适合您的技术缩小数据。

Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.最后,答案仍然是这些库可能是兼容的,只是数据对于可用 RAM 来说太大了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM