简体   繁体   English

Dask:如何将延迟函数与工作资源一起使用?

[英]Dask: How to use delayed functions with worker resources?

I want to make a Dask Delayed flow which includes CPU and GPU tasks.我想制作一个包含 CPU 和 GPU 任务的 Dask Delayed 流。 GPU tasks can only run on GPU workers, and a GPU worker only has one GPU and can only handle one GPU task at a time. GPU 任务只能在 GPU Worker 上运行,而一个 GPU Worker 只有一个 GPU,一次只能处理一个 GPU 任务。

Unfortunately, I see no way to specify worker resources in the Delayed API.不幸的是,我看不到在延迟 API 中指定工作资源的方法。

Here is common code:下面是常用代码:

client = Client(resources={'GPU': 1})

@delayed
def fcpu(x, y):
    sleep(1)
    return x + y

@delayed
def fgpu(x, y):
    sleep(1)
    return x + y

Here is the flow written in pure Delayed.这是用纯延迟编写的流程。 This code will not behave properly because it doesn't know about the GPU resource.此代码将无法正常运行,因为它不了解 GPU 资源。

# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)

# STEP TWO: two GPU tasks
c = fgpu(a, b)  # Requires 1 GPU
d = fgpu(a, b)  # Requires 1 GPU

# STEP THREE: final CPU task
e = fcpu(c, d)

%time e.compute()  # 3 seconds

This is the best solution I could come up with.这是我能想到的最好的解决方案。 It combines Delayed syntax with Client.compute() futures.它结合了延迟语法和 Client.compute() 期货。 It seems to behave correctly, but it is very ugly.它似乎表现正确,但它非常难看。

# STEP ONE: two parallel CPU tasks
a = fcpu(1, 1)
b = fcpu(10, 10)
a_future, b_future = client.compute([a, b]) # Wo DON'T want a resource limit

# STEP TWO: two GPU tasks - only resources to run one at a time
c = fgpu(a_future, b_future)
d = fgpu(a_future, b_future)
c_future, d_future = client.compute([c, d], resources={'GPU': 1})

# STEP THREE: final CPU task
e = fcpu(c_future, d_future)
res = e.compute()

Is there a better way to do this?有一个更好的方法吗?

Maybe an approach similar to what is described in https://jobqueue.dask.org/en/latest/examples.html It is a case of processing on one GPU machine or a machine with SSD.也许类似于https://jobqueue.dask.org/en/latest/examples.html中描述的方法这是在一台 GPU 机器或带有 SSD 的机器上处理的情况。

def step_1_w_single_GPU(data):
    return "Step 1 done for: %s" % data


def step_2_w_local_IO(data):
    return "Step 2 done for: %s" % data


stage_1 = [delayed(step_1_w_single_GPU)(i) for i in range(10)]
stage_2 = [delayed(step_2_w_local_IO)(s2) for s2 in stage_1]

result_stage_2 = client.compute(stage_2,
                                resources={tuple(stage_1): {'GPU': 1},
                                           tuple(stage_2): {'ssdGB': 100}})

This is possible with annotations , see the example in docs :这可以通过annotations来实现,请参阅文档中的示例:

x = dd.read_csv(...)
with dask.annotate(resources={'GPU': 1}):
    y = x.map_partitions(func1)
z = y.map_partitions(func2)

z.compute(optimize_graph=False)

As noted in the graph, such annotations can be lost during optimization, hence the kwarg optimize_graph=False .正如图中所指出的,这样的注释在优化过程中可能会丢失,因此 kwarg optimize_graph=False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM