简体   繁体   中英

How to determine the maximal number of dask.distributed.cluster workers

TL;DR: how to access the maximal number of workers accessible to dask.distributed.client , which would include the workers that aren't started yet, and would work with the adaptive as well as non-adaptive scaling strategy?

I develop a library for adaptive parallel execution of functions, which plans ahead what points to execute. For that we need to know how many workers can be accessed in parallel, and we use client.ncores() function.

This approach is, however, problematic, for several reasons:

  • The workers need to be running in the first place because ncores only tells about what is happening now.
  • If the cluster has adaptive scaling, we're interested in the maximal number of workers that we can get rather than in the current one.

Therefore I would like to know if there's a programmatic way to inspect the client and determine how many workers a dask cluster can acquire.

A general solution to what you're looking for doesn't and can't (to my knowledge) exist. There's no hard limit on the number of dask workers for many system types. Anything you find here will likely have to be tailored to individual fall cluster variants, if it works at all.

This would have to be a property of the Cluster, not the client. LocalClusters can spin up as many workers/processes/threads as desired - at some point this will not be efficient but the number isn't bounded. See the LocalCluster and its parent class SpecCluster for implementation details.

Other flavors such as dask_jobqueue have a totally different model, whereby nodes are allocated by the HPC workload manager and are theoretically unbounded but are in practice limited by other workloads on the cluster and by your account & the HPC's configuration; similarly kube_cluster will scale up unless limited by the helm chart, quotas, available resources, or your credit card bouncing. If using dask Gateway , the administrator can specify maximum core , memory , and worker counts for the entire cluster, and these limits can be accessed through the cluster config. However, if these are unbounded, there's no hard limit to these values that is known to dask.

If you need to implement this for a specific cluster variant you could narrow your question and maybe get more specific help; but I think you may be limited here to using the APIs of each cluster to get the info you want (eg google/aws/azure etc's kube.netes API, maybe pyslurm, etc), and there's no guarantee that there's an actual upper bound for many of these options.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM