简体   繁体   English

在 dask localcluster 上管理工作内存

[英]Managing worker memory on a dask localcluster

I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this:我正在尝试使用 dask 加载数据集,但是当需要计算我的数据集时,我不断遇到这样的问题:

WARNING - Worker exceeded 95% memory budget.警告 - Worker 超出了 95% 的内存预算。 Restarting.重新启动。

I am just working on my local machine, initiating dask as follows:我只是在我的本地机器上工作,按如下方式启动 dask:

if __name__ == '__main__':
    libmarket.config.client = Client()  # use dask.distributed by default

Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter.现在,在我的错误消息中,我一直看到对 'memory_limit=' 关键字参数的引用。 However I've searched the dask documentation thoroughly and I can't figure out how to increase the bloody worker memory-limit in a single-machine configuration.但是,我已经彻底搜索了 dask 文档,但我无法弄清楚如何在单机配置中增加该死的工人内存限制。 I have 256GB of RAM and I'm removing the majority of the future's columns (a 20GB csv file) before converting it back into a pandas dataframe, so I know it will fit in memory.我有 256GB 的 RAM,在将未来的大部分列(一个 20GB 的 csv 文件)转换回 Pandas 数据帧之前,我正在删除它,所以我知道它会适合内存。 I just need to increase the per-worker memory limit from my code (not using dask-worker) so that I can process it.我只需要从我的代码(不使用 dask-worker)中增加每个工人的内存限制,以便我可以处理它。

Please, somebody help me.请有人帮助我。

The argument memory_limit can be provided to the __init()__ functions of Client and LocalCluster .参数memory_limit可以提供给ClientLocalCluster__init()__函数。

general remarks一般说明

Just calling Client() is a shortcut for first calling LocalCluster() and, then, Client with the created cluster ( Dask: Single Machine ).仅调用Client()是首先调用LocalCluster()的快捷方式,然后调用具有创建集群的Client ( Dask: Single Machine )。 When Client is called without an instance of LocalCluster , all possible arguments of LocalCluster.__init()__ can be provided to the initialization call of Client .Client在没有LocalCluster实例的情况下被调用时, LocalCluster.__init()__所有可能参数都可以提供给Client的初始化调用。 Therefore, the argument memory_limit (and other arguments such as n_workers ) are not documented in the API documentation of the Client class.因此,参数memory_limit (以及其他参数,例如n_workers )没有记录在Client类的 API 文档中。

However, the argument memory_limit does not seem to be properly documented in the API documentation of LocalCluster (see Dask GitHub Issue #4118 ).但是, LocalCluster的 API 文档中似乎没有正确记录参数memory_limit (请参阅 Dask GitHub 问题#4118 )。

solution解决方案

A working example would be the following.一个工作示例如下。 I added some more arguments, which might be useful for people finding this question/answer.我添加了更多参数,这可能对找到此问题/答案的人有用。

# load/import classes
from dask.distributed import Client, LocalCluster

# set up cluster and workers
cluster = LocalCluster(n_workers=4, 
                       threads_per_worker=1,
                       memory_limit='64GB')
client = Client(cluster)

# have a look at your workers
client

# do some work
## ... 

# close workers and cluster
client.close()
cluster.close()

The shortcut would be捷径是

# load/import classes
from dask.distributed import Client

# set up cluster and workers
client = Client(n_workers=4, 
                threads_per_worker=1,
                memory_limit='64GB')

# have a look at your workers
client

# do some work
## ... 

# close workers and cluster
client.close()

further reading进一步阅读

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM