简体   繁体   中英

Is it possible to select workers for specific tasks in Dask?

I have a process I'm running on my Kubernetes cluster with Dask that consists of two map-reduce phases, but both maps across the nodes download potentially numerous large files to each worker. In order to avoid having two different machines process the same subset of files on the two different map steps, is it possible to deterministically select which workers get which arguments for the same jobs? Conceptually, what I want might be something like:

workers : List = client.get_workers();
#                       ^^^^^^^^^^^
filenames : List[str] = get_filenames(); # input data to process

# map each file to a specific worker
file_to_worker = { filename : workers[hash(filename) % len(workers)] for filename in filenames }

# submit each file, specifying which worker should be assigned the task
futures = [client.submit(my_func, filename, worker=file_to_worker[filename]) for filename in filenames]
#                                           ^^^^^^ 

Something like this would allow me to direct different steps of computation for the same files to the same nodes, eliminating any need to do a second caching of files.

yes, you can submit functions to specific workers:

c.run(func, workers=[WorkerA, WorkerB, WorkerC])

You can also attach metadata resources to workers and submit with those definition instead of the specific hostnames:

data = [client.submit(load, fn) for fn in filenames]
processed = [client.submit(process, d, resources={'GPU': 1}) for d in data]
final = client.submit(aggregate, processed, resources={'MEMORY': 70e9})

For setup info look at the resource docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM