[英]Managing worker memory on a dask localcluster
I am trying to load a dataset with dask but when it is time to compute my dataset I keep getting problems like this:我正在尝试使用 dask 加载数据集,但是当需要计算我的数据集时,我不断遇到这样的问题:
WARNING - Worker exceeded 95% memory budget.
警告 - Worker 超出了 95% 的内存预算。 Restarting.
重新启动。
I am just working on my local machine, initiating dask as follows:我只是在我的本地机器上工作,按如下方式启动 dask:
if __name__ == '__main__':
libmarket.config.client = Client() # use dask.distributed by default
Now in my error messages I keep seeing a reference to a 'memory_limit=' keyword parameter.现在,在我的错误消息中,我一直看到对 'memory_limit=' 关键字参数的引用。 However I've searched the dask documentation thoroughly and I can't figure out how to increase the bloody worker memory-limit in a single-machine configuration.
但是,我已经彻底搜索了 dask 文档,但我无法弄清楚如何在单机配置中增加该死的工人内存限制。 I have 256GB of RAM and I'm removing the majority of the future's columns (a 20GB csv file) before converting it back into a pandas dataframe, so I know it will fit in memory.
我有 256GB 的 RAM,在将未来的大部分列(一个 20GB 的 csv 文件)转换回 Pandas 数据帧之前,我正在删除它,所以我知道它会适合内存。 I just need to increase the per-worker memory limit from my code (not using dask-worker) so that I can process it.
我只需要从我的代码(不使用 dask-worker)中增加每个工人的内存限制,以便我可以处理它。
Please, somebody help me.请有人帮助我。
The argument memory_limit
can be provided to the __init()__
functions of Client
and LocalCluster
.参数
memory_limit
可以提供给Client
和LocalCluster
的__init()__
函数。
Just calling Client()
is a shortcut for first calling LocalCluster()
and, then, Client
with the created cluster ( Dask: Single Machine ).仅调用
Client()
是首先调用LocalCluster()
的快捷方式,然后调用具有创建集群的Client
( Dask: Single Machine )。 When Client
is called without an instance of LocalCluster
, all possible arguments of LocalCluster.__init()__
can be provided to the initialization call of Client
.当
Client
在没有LocalCluster
实例的情况下被调用时, LocalCluster.__init()__
所有可能参数都可以提供给Client
的初始化调用。 Therefore, the argument memory_limit
(and other arguments such as n_workers
) are not documented in the API documentation of the Client
class.因此,参数
memory_limit
(以及其他参数,例如n_workers
)没有记录在Client
类的 API 文档中。
However, the argument memory_limit
does not seem to be properly documented in the API documentation of LocalCluster
(see Dask GitHub Issue #4118 ).但是,
LocalCluster
的 API 文档中似乎没有正确记录参数memory_limit
(请参阅 Dask GitHub 问题#4118 )。
A working example would be the following.一个工作示例如下。 I added some more arguments, which might be useful for people finding this question/answer.
我添加了更多参数,这可能对找到此问题/答案的人有用。
# load/import classes
from dask.distributed import Client, LocalCluster
# set up cluster and workers
cluster = LocalCluster(n_workers=4,
threads_per_worker=1,
memory_limit='64GB')
client = Client(cluster)
# have a look at your workers
client
# do some work
## ...
# close workers and cluster
client.close()
cluster.close()
The shortcut would be捷径是
# load/import classes
from dask.distributed import Client
# set up cluster and workers
client = Client(n_workers=4,
threads_per_worker=1,
memory_limit='64GB')
# have a look at your workers
client
# do some work
## ...
# close workers and cluster
client.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.