[英]Memory allocation error Call to XGBoost C function XGBoosterUpdateOneIter failed: std::bad_alloc
Working with Julia notebook on Sagemaker: ml.m5d.24xlarge
with 500GB
memory.在 Sagemaker 上使用 Julia 笔记本:
ml.m5d.24xlarge
with 500GB
memory。
I'm training an XGBoost with 230 features (500MB per file on avg).我正在训练具有 230 个特征的 XGBoost(平均每个文件 500MB)。 It trains without an issue upto 205 files, but afterwards, randomly I get this error
它可以毫无问题地训练多达 205 个文件,但之后,随机出现此错误
> ┌ Info: Starting XGBoost training
└ num_boost_rounds = 99
ERROR: LoadError: Call to XGBoost C function XGBoosterUpdateOneIter failed: std::bad_alloc
Stacktrace:
[1] error(::String, ::String, ::String, ::String)
@ Base ./error.jl:42
[2] XGBoosterUpdateOneIter(handle::Ptr{Nothing}, iter::Int32, dtrain::Ptr{Nothing})
@ XGBoost ~/.julia/packages/XGBoost/fI0vs/src/xgboost_wrapper_h.jl:11
[3] #update#21
@ ~/.julia/packages/XGBoost/fI0vs/src/xgboost_lib.jl:204 [inlined]
[4] xgboost(data::XGBoost.DMatrix, nrounds::Int64; label::Type, param::Vector{Any}, watchlist::Vector{Any}, metrics::Vector{String}, obj::Type, feval::Type, group::Vector{Any}, kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{15, Symbol}, NamedTuple{(:objective, :num_class, :num_parallel_tree, :eta, :gamma, :max_depth, :min_child_weight, :max_delta_step, :subsample, :colsample_bytree, :lambda, :alpha, :tree_method, :grow_policy, :max_leaves), Tuple{String, Int64, Int64, Float64, Float64, Int64, Int64, Int64, Float64, Float64, Int64, Int64, String, String, Int64}}})
@ XGBoost ~/.julia/packages/XGBoost/fI0vs/src/xgboost_lib.jl:185
[5] macro expansion
@ /home/src/Training.jl:175 [inlined]
[6] macro expansion
@ ./timing.jl:210 [inlined]
Not sure how to fix it.不知道如何解决它。 The AWS instance has maximum CPU memory. Also, already using 99 procs/workers.
AWS 实例的最大 CPU 为 memory。此外,已经使用了 99 个 procs/worker。
This looks like you're trying to allocate more memory than what is available on the machine.这看起来您正在尝试分配比机器上可用的更多的 memory。
Unfortunately not much to do here other than sub-sample your dataset or try a larger instance.不幸的是,除了对数据集进行子采样或尝试更大的实例之外,这里没什么可做的。
An alternative is to try distributed training, using something like Dask: https://xgboost.readthedocs.io/en/stable/tutorials/dask.html另一种方法是尝试分布式训练,使用类似 Dask 的东西: https://xgboost.readthedocs.io/en/stable/tutorials/dask.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.