Memory 分配错误调用 XGBoost C function XGBoosterUpdateOneIter 失败：std::bad_alloc

Question

Working with Julia notebook on Sagemaker: ml.m5d.24xlarge with 500GB memory.在 Sagemaker 上使用 Julia 笔记本： ml.m5d.24xlarge with 500GB memory。

I'm training an XGBoost with 230 features (500MB per file on avg).我正在训练具有 230 个特征的 XGBoost（平均每个文件 500MB）。 It trains without an issue upto 205 files, but afterwards, randomly I get this error它可以毫无问题地训练多达 205 个文件，但之后，随机出现此错误

> ┌ Info: Starting XGBoost training
└   num_boost_rounds = 99
ERROR: LoadError: Call to XGBoost C function XGBoosterUpdateOneIter failed: std::bad_alloc
Stacktrace:
  [1] error(::String, ::String, ::String, ::String)
    @ Base ./error.jl:42
  [2] XGBoosterUpdateOneIter(handle::Ptr{Nothing}, iter::Int32, dtrain::Ptr{Nothing})
    @ XGBoost ~/.julia/packages/XGBoost/fI0vs/src/xgboost_wrapper_h.jl:11
  [3] #update#21
    @ ~/.julia/packages/XGBoost/fI0vs/src/xgboost_lib.jl:204 [inlined]
  [4] xgboost(data::XGBoost.DMatrix, nrounds::Int64; label::Type, param::Vector{Any}, watchlist::Vector{Any}, metrics::Vector{String}, obj::Type, feval::Type, group::Vector{Any}, kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{15, Symbol}, NamedTuple{(:objective, :num_class, :num_parallel_tree, :eta, :gamma, :max_depth, :min_child_weight, :max_delta_step, :subsample, :colsample_bytree, :lambda, :alpha, :tree_method, :grow_policy, :max_leaves), Tuple{String, Int64, Int64, Float64, Float64, Int64, Int64, Int64, Float64, Float64, Int64, Int64, String, String, Int64}}})
    @ XGBoost ~/.julia/packages/XGBoost/fI0vs/src/xgboost_lib.jl:185
  [5] macro expansion
    @ /home/src/Training.jl:175 [inlined]
  [6] macro expansion
    @ ./timing.jl:210 [inlined]

Not sure how to fix it.不知道如何解决它。 The AWS instance has maximum CPU memory. Also, already using 99 procs/workers. AWS 实例的最大 CPU 为 memory。此外，已经使用了 99 个 procs/worker。

Answer 1

This looks like you're trying to allocate more memory than what is available on the machine.这看起来您正在尝试分配比机器上可用的更多的 memory。

Unfortunately not much to do here other than sub-sample your dataset or try a larger instance.不幸的是，除了对数据集进行子采样或尝试更大的实例之外，这里没什么可做的。

An alternative is to try distributed training, using something like Dask: https://xgboost.readthedocs.io/en/stable/tutorials/dask.html另一种方法是尝试分布式训练，使用类似 Dask 的东西： https://xgboost.readthedocs.io/en/stable/tutorials/dask.html

Memory 分配错误调用 XGBoost C function XGBoosterUpdateOneIter 失败：std::bad_alloc

问题描述

1 个解决方案

解决方案1
1 2022-03-22 17:15:14

Memory 分配错误调用 XGBoost C function XGBoosterUpdateOneIter 失败：std::bad_alloc

问题描述

1 个解决方案

解决方案1 1 2022-03-22 17:15:14

解决方案1
1 2022-03-22 17:15:14