简体   繁体   English

使用 ray + light gbm + limited memory

[英]using ray + light gbm + limited memory

So, I would like to train a lightGBM on a remote, large ray cluster and a large dataset.所以,我想在远程、大型光线集群和大型数据集上训练 lightGBM。 Before that, I would like to write the code such that I can run the training also in a memory-constrained setting, eg my local laptop, where the dataset does not fit in-mem.在此之前,我想编写代码,以便我也可以在内存受限的设置中运行训练,例如我的本地笔记本电脑,其中数据集不适合内存。 That will require some way of lazy loading the data.这将需要一些延迟加载数据的方法。

The way I imagine it, I should be possible with ray to load batches of random samples of the large dataset from disk (multiple.pq files) and feed them to the lightgbm training function. The memory should thereby act as a fast buffer, which contains random, loaded batches that are fed to the training function and then removed from memory. Multiple workers take care of training + IO ops for loading new samples from disk into memory. The maximum amount of memory can be constrained to not exceed my local resources, such that my pc doesn't crash.按照我的想象,我应该可以使用 ray 从磁盘(多个.pq 文件)加载大型数据集的随机样本批次,并将它们提供给 lightgbm 训练 function。memory 应该因此充当快速缓冲区,这包含随机加载的批次,这些批次被送入训练 function,然后从 memory 中移除。多个工作人员负责训练 + IO 操作,用于将新样本从磁盘加载到 memory。可以限制 memory 的最大数量不超过我的本地资源,这样我的电脑就不会崩溃。 Is this possible?这可能吗?

I did not understand yet whether the LGBM needs the full dataset at once, or can be fed batches iteratively, such as with neural.networks, for instance.我还不明白 LGBM 是否需要一次完整的数据集,或者可以迭代地分批输入,例如使用 neural.networks。 So far, I have tried using the lightgbm_ray lib for this:到目前为止,我已经尝试为此使用 lightgbm_ray 库:

from lightgbm_ray import RayDMatrix, RayParams, train, RayFileType

# some stuff before 
... 

# make dataset
data_train = RayDMatrix(
    data=filenames,
    label=TARGET,
    feature_names=features,
    filetype=RayFileType.PARQUET,
    num_actors=2,
    lazy=True,
)

# feed to training function
evals_result = {}
bst = train(
    params_model,
    data_train,
    evals_result=evals_result,
    valid_sets=[data_train],
    valid_names=["train"],
    verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=2)
)

I thought the lazy=True keyword might take care of it, however, when executing this, I see the memory being maxed out and then my app crashes.我认为 lazy=True 关键字可能会处理它,但是,执行此操作时,我看到 memory 已用完,然后我的应用程序崩溃了。

Thanks for any advice!感谢您的任何建议!

LightGBM requires loading the entire dataset for training, so in this case, you can test on your laptop with a subset of the data (ie only pass a subset of the parquet filenames in). LightGBM 需要加载整个数据集进行训练,因此在这种情况下,您可以在笔记本电脑上使用数据的子集进行测试(即只传入 parquet 文件名的子集)。

The lazy=True flag delays the data loading to be split across the actors, rather than loading into memory first, then splitting+sending to actors. lazy=True标志延迟数据加载以在 actors 之间拆分,而不是先加载到 memory,然后拆分+发送给 actors。 However, this would still load the entire dataset into memory, since all actors are on the same (local) node.但是,这仍会将整个数据集加载到 memory 中,因为所有参与者都在同一个(本地)节点上。

Additionally, when you do move to running on the remote cluster, these tips might be helpful to optimize memory usage: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how-to-optimize-xgboost-memory-usage .此外,当您确实转向在远程集群上运行时,这些提示可能有助于优化 memory 的使用: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how -to-optimize-xgboost-memory-usage

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM