简体   繁体   中英

How to increase scheduler memory in GKE for DASK

I have deployed a kubernetes cluster on GCP with a combination of prefect and dask. The jobs run fine in a normal scenario but it is failing to scale for 2 times the data. So far, I have narrowed it down to scheduler getting shut off due to high memory usage. Dask scheduler Memory As soon as the memory usage touches 2GB, the jobs get failed up with "no heartbeat detected" error.

There is a separate build python file available where we set worker memory and cpu. There is a dask-gateway package where we get the Gateway options and set up the worker memory.

options.worker_memory = 32
options.worker_cores = 10
cluster = gateway.new_cluster(options)
cluster.adapt(minimum=4, maximum=20)

I am unable to figure out where and how can i increase the memory allocation for dask-scheduler.

Specs:
Cluster Version: 1.19.14-gke.1900
Machine type - n1-highmem-64
Autoscaling set to 6 - 1000 nodes per zone
all nodes are allocated 63.77 CPU and 423.26 GB

To explain why Flow heartbeats exist in the first place: Prefect uses heartbeats to check if your flow is still running. If Prefect didn't have heartbeats, flows that lost communication and died on a remote execution environment such as a Kubernetes job, would permanently be shown as Running in the UI. Usually "no heartbeat detected" happens as a result of running out of memory, or when your flows execute long-running jobs.

One solution that you could try is to set the following environment variable on your run configuration - this will change the heartbeats behavior from processes to threads and can help solve the issue:

from prefect.run_configs import UniversalRun

flow.run_config = UniversalRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

As you mentioned, the best solution would be to increase the memory of your Dask workers. If you use a long-running cluster, you can set it this way:

dask-worker tcp://scheduler:port --memory-limit="4 GiB"

And if you pass a cluster class to your Dask executor, eg coiled.Cluster , you can set both:

  • scheduler_memory - defaults to 4 GiB
  • worker_memory - defaults to 8 GiB

Here is how you could set that on your flow:

import coiled
from prefect import Flow
from prefect.executors import DaskExecutor

flow = Flow("test-flow")
executor = DaskExecutor(
    cluster_class=coiled.Cluster,
    cluster_kwargs={
        "software": "user/software_env_name",
        "shutdown_on_close": True,
        "name": "prefect-cluster",
        "scheduler_memory": "4 GiB",
        "worker_memory": "8 GiB",
    },
)
flow.executor = executor

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM