正确使用 gpus-per-task 通过 SLURM 分配不同的 GPU

Question

I am using the cons_tres SLURM plugin, which introduces, among other things, the --gpus-per-task option.我正在使用cons_tres SLURM 插件，其中引入了--gpus-per-task选项等。 If my understanding is correct, the following script should allocate two distinct GPUs on the same node:如果我的理解是正确的，下面的脚本应该在同一个节点上分配两个不同的GPU：

#!/bin/bash
#SBATCH --ntasks=2
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2
#SBATCH --gpus-per-task=1

srun --ntasks=2 --gres=gpu:1 nvidia-smi -L

However, it doesn't, as the output is但是，它没有，因为 output 是

GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)

What gives?是什么赋予了？

Related: https://stackoverflow.com/a/55029430/10260561相关： https://stackoverflow.com/a/55029430/10260561

Edit编辑

Alternatively, the srun command could be或者， srun命令可以是

srun --ntasks=1 --gres=gpu:1 nvidia-smi -L &
srun --ntasks=1 --gres=gpu:1 nvidia-smi -L &
wait

ie, run the two tasks in parallel, each on 1 GPU.即，并行运行两个任务，每个任务在 1 个 GPU 上。 This also doesn't work, and gives这也不起作用，并给出

GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
srun: Job 627 step creation temporarily disabled, retrying
srun: Step created for job 627
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)

Leaving out the extra parameters and calling srun nvidia-smi -L results in省略额外的参数并调用srun nvidia-smi -L会导致

GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 1: Tesla V100-SXM3-32GB (UUID: GPU-ce697126-4112-a696-ff6b-1b072cdf03a2)
GPU 0: Tesla V100-SXM3-32GB (UUID: GPU-c55b3036-d54d-a885-7c6c-4238840c836e)
GPU 1: Tesla V100-SXM3-32GB (UUID: GPU-ce697126-4112-a696-ff6b-1b072cdf03a2)

ie, 4 tasks are being run?即，正在运行 4 个任务？

I need to run two tasks in parallel on distinct GPUs.我需要在不同的 GPU 上并行运行两个任务。

Answer 1

All the #SBATCH s allocate you two tasks and two GPUs on one node.所有#SBATCH在一个节点上为您分配两个任务和两个 GPU。 So far so good.到目前为止，一切都很好。 But then you tell srun to use both available tasks and only use one GPU.但是你告诉srun使用两个可用的任务并且只使用一个 GPU。 This is why these two tasks have a shared GPU.这就是为什么这两个任务有一个共享的 GPU。

To solve it you can leave out the extra parameters when calling srun.为了解决这个问题，您可以在调用 srun 时省略额外的参数。 It will use all available tasks and GPUs per default.默认情况下，它将使用所有可用的任务和 GPU。

Answer 2

This does what I want这做我想要的

srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE

CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0

but doesn't make use of --gpus-per-task .但不使用--gpus-per-task 。

正确使用 gpus-per-task 通过 SLURM 分配不同的 GPU

问题描述

2 个解决方案

解决方案1
1 2021-02-08 07:04:16

解决方案2
0 已采纳 2021-02-08 13:52:05

正确使用 gpus-per-task 通过 SLURM 分配不同的 GPU

问题描述

2 个解决方案

解决方案1 1 2021-02-08 07:04:16

解决方案2 0 已采纳 2021-02-08 13:52:05

解决方案1
1 2021-02-08 07:04:16

解决方案2
0 已采纳 2021-02-08 13:52:05