简体   繁体   English

如何在 GKE for DASK 中增加调度程序内存

[英]How to increase scheduler memory in GKE for DASK

I have deployed a kubernetes cluster on GCP with a combination of prefect and dask.我在 GCP 上部署了一个 kubernetes 集群,结合了 prefect 和 dask。 The jobs run fine in a normal scenario but it is failing to scale for 2 times the data.作业在正常情况下运行良好,但无法扩展 2 倍的数据。 So far, I have narrowed it down to scheduler getting shut off due to high memory usage.到目前为止,我已将范围缩小到调度程序因高内存使用而被关闭。 Dask scheduler Memory As soon as the memory usage touches 2GB, the jobs get failed up with "no heartbeat detected" error. Dask 调度程序内存一旦内存使用量达到 2GB,作业就会因“未检测到心跳”错误而失败。

There is a separate build python file available where we set worker memory and cpu.有一个单独的构建 python 文件可用,我们可以在其中设置工作内存和 CPU。 There is a dask-gateway package where we get the Gateway options and set up the worker memory.有一个 dask-gateway 包,我们可以在其中获取网关选项并设置工作内存。

options.worker_memory = 32
options.worker_cores = 10
cluster = gateway.new_cluster(options)
cluster.adapt(minimum=4, maximum=20)

I am unable to figure out where and how can i increase the memory allocation for dask-scheduler.我无法弄清楚在哪里以及如何增加 dask-scheduler 的内存分配。

Specs:
Cluster Version: 1.19.14-gke.1900
Machine type - n1-highmem-64
Autoscaling set to 6 - 1000 nodes per zone
all nodes are allocated 63.77 CPU and 423.26 GB

To explain why Flow heartbeats exist in the first place: Prefect uses heartbeats to check if your flow is still running.首先解释为什么 Flow 心跳存在:Prefect 使用心跳来检查您的 Flow 是否仍在运行。 If Prefect didn't have heartbeats, flows that lost communication and died on a remote execution environment such as a Kubernetes job, would permanently be shown as Running in the UI.如果 Prefect 没有心跳,在远程执行环境(例如 Kubernetes 作业)中失去通信并死亡的流将在 UI 中永久显示为正在运行。 Usually "no heartbeat detected" happens as a result of running out of memory, or when your flows execute long-running jobs.通常“未检测到心跳”是由于内存不足或当您的流程执行长时间运行的作业时发生的。

One solution that you could try is to set the following environment variable on your run configuration - this will change the heartbeats behavior from processes to threads and can help solve the issue:您可以尝试的一种解决方案是在运行配置中设置以下环境变量 - 这会将心跳行为从进程更改为线程,并有助于解决问题:

from prefect.run_configs import UniversalRun

flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

As you mentioned, the best solution would be to increase the memory of your Dask workers.正如您所提到的,最好的解决方案是增加 Dask 工作人员的内存。 If you use a long-running cluster, you can set it this way:如果使用长时间运行的集群,可以这样设置:

dask-worker tcp://scheduler:port --memory-limit="4 GiB"

And if you pass a cluster class to your Dask executor, eg coiled.Cluster , you can set both:如果你将一个集群类传递给你的 Dask 执行器,例如coiled.Cluster ,你可以同时设置:

  • scheduler_memory - defaults to 4 GiB scheduler_memory - 默认为 4 GiB
  • worker_memory - defaults to 8 GiB worker_memory - 默认为 8 GiB

Here is how you could set that on your flow:以下是您在流程中设置的方法:

import coiled
from prefect import Flow
from prefect.executors import DaskExecutor

flow = Flow("test-flow")
executor = DaskExecutor(
    cluster_class=coiled.Cluster,
    cluster_kwargs={
        "software": "user/software_env_name",
        "shutdown_on_close": True,
        "name": "prefect-cluster",
        "scheduler_memory": "4 GiB",
        "worker_memory": "8 GiB",
    },
)
flow.executor = executor

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM