[英]Why Dask is not respecting the memory limits for LocalCluster?
I'm running the code pasted below in a machine with 16GB of RAM (purposely).我正在一台有 16GB RAM 的机器上运行下面粘贴的代码(有意)。
import dask.array as da
import dask.delayed
from sklearn.datasets import make_blobs
import numpy as np
from dask_ml.cluster import KMeans
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1, processes=False,
memory_limit='2GB', scheduler_port=0,
silence_logs=False, dashboard_address=8787)
n_centers = 12
n_features = 4
X_small, y_small = make_blobs(n_samples=1000, centers=n_centers, n_features=n_features, random_state=0)
centers = np.zeros((n_centers, n_features))
for i in range(n_centers):
centers[i] = X_small[y_small == i].mean(0)
print(centers)
n_samples_per_block = 450 * 650 * 900
n_blocks = 4
delayeds = [dask.delayed(make_blobs)(n_samples=n_samples_per_block,
centers=centers,
n_features=n_features,
random_state=i)[0]
for i in range(n_blocks)]
arrays = [da.from_delayed(obj, shape=(n_samples_per_block, n_features), dtype=X_small.dtype)
for obj in delayeds]
X = da.concatenate(arrays)
print(X)
X = X.rechunk((1000, 4))
clf = KMeans(init_max_iter=3, oversampling_factor=10)
clf.fit(X)
client.close()
Considering I'm creating 4 workers with 2 GB of memory limits (total of 8 GB), I would like to see this algorithm not exceeding the amount of memory of that machine.考虑到我正在创建 4 个具有 2 GB memory 限制(总共 8 GB)的工作人员,我希望看到该算法不超过该机器的 memory 的数量。 Unfortunately, it is using more than 16 GB and swap.
不幸的是,它使用了超过 16 GB 和交换空间。
I really don't know what is wrong with that code of if I misunderstood the concepts of Dask (specially because this code does not have any complexity in terms of data dependencies).如果我误解了 Dask 的概念,我真的不知道该代码有什么问题(特别是因为这段代码在数据依赖性方面没有任何复杂性)。
This is not a direct answer to the problem of dask
not respecting the memory constraint (short answer seems to be this is not a binding constraint), however the code can be improved along these directions:这不是对
dask
不遵守 memory 约束的问题的直接回答(简短的回答似乎是这不是绑定约束),但是可以按照以下方向改进代码:
make_blobs
that was adapted by dask_ml
: this reduces overhead due to construction of a dask array and related reshaping;make_blobs
改编的dask_ml
:这减少了由于构建 dask 数组和相关重塑而产生的开销;.close
, especially if there are errors in the code executed on the workers..close
,特别是如果在工作人员上执行的代码中存在错误。from dask.distributed import Client
from dask_ml.cluster import KMeans
from dask_ml.datasets import make_blobs
client_params = dict(
n_workers=4,
threads_per_worker=1,
processes=False,
memory_limit="2GB",
scheduler_port=0,
silence_logs=False,
dashboard_address=8787,
)
n_centers = 12
n_features = 4
n_samples = 1000 * 100
chunks = (1000 * 50, 4)
X, _ = make_blobs(
n_samples=n_samples,
centers=n_centers,
n_features=n_features,
random_state=0,
chunks=chunks,
)
clf = KMeans(init_max_iter=3, oversampling_factor=10, n_clusters=n_centers)
with Client(**client_params) as client:
result = clf.fit(X)
print(result.cluster_centers_)
This is meant as as complement to @SultanOrazbayev's answer, which was much faster than the original snippet and ran well within the allocated memory.这是对@SultanOrazbayev 的回答的补充,它比原始片段快得多,并且在分配的 memory 内运行良好。
Using the Dask Dashboard's "Woker's Memory" panel , I also saw the total process memory exceeding the 2GB memory limit and this warning: distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
使用Dask Dashboard 的“Woker's Memory”面板,我还看到了总进程 memory 超过了 2GB memory 限制和这个警告:
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
and Unmanaged memory: 3.57 GiB -- Worker memory limit: 1.86 GiB
with the unmanaged memory increasing as the computation continued. distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS
和Unmanaged memory: 3.57 GiB -- Worker memory limit: 1.86 GiB
,非托管 memory 随着计算的继续而增加。 Usually in this case the recommendation is to manually trim the memory , however, as explained in a similar question on the Dask Discourse , this is not currently possible with KMeans
( here is the open issue).通常在这种情况下,建议是手动修剪 memory ,但是,正如Dask Discourse 上的类似问题中所解释的那样,目前
KMeans
无法做到这一点(这是未解决的问题)。 Hopefully this adds some helpful context.希望这会增加一些有用的上下文。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.