Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

Question

There are two ways to allocate GPUs in Slurm: either the general --gres=gpu:N parameter, or the specific parameters like --gpus-per-task=N .在 Slurm 中有两种分配 GPU 的方法：通用的--gres=gpu:N参数，或者像--gpus-per-task=N这样的特定参数。 There are also two ways to launch MPI tasks in a batch script: either using srun , or using the usual mpirun (when OpenMPI is compiled with Slurm support).还有两种方法可以在批处理脚本中启动 MPI 任务：使用srun或使用通常的mpirun （当 OpenMPI 使用 Slurm 支持编译时）。 I found some surprising differences in behaviour between these methods.我发现这些方法之间的行为存在一些令人惊讶的差异。

I'm submitting a batch job with sbatch where the basic script is the following:我正在使用sbatch提交批处理作业，其中基本脚本如下：

#!/bin/bash

#SBATCH --job-name=sim_1        # job name (default is the name of this file)
#SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=1:00:00          # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --partition=gpXY        # put the job into the gpu partition
#SBATCH --exclusive             # request exclusive allocation of resources
#SBATCH --mem=20G               # RAM per node
#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=4       # number of CPUs per process

## nodes allocation
#SBATCH --nodes=2               # number of nodes
#SBATCH --ntasks-per-node=2     # MPI processes per node

## GPU allocation - variant A
#SBATCH --gres=gpu:2            # number of GPUs per node (gres=gpu:N)

## GPU allocation - variant B
## #SBATCH --gpus-per-task=1       # number of GPUs per process
## #SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# program execution - variant 1
mpirun ./sim

# program execution - variant 2
#srun ./sim

The #SBATCH options in the first block are quite obvious and uninteresting.第一个块中的#SBATCH选项非常明显且无趣。 Next, the behaviour I'll describe is observable when the job runs on at least 2 nodes.接下来，当作业在至少 2 个节点上运行时，我将描述的行为是可观察的。 I'm running 2 tasks per node since we have 2 GPUs per node.我每个节点运行 2 个任务，因为我们每个节点有 2 个 GPU。 Finally, there are two variants of GPU allocation (A and B) and two variants of program execution (1 and 2).最后，还有 GPU 分配的两种变体（A 和 B）和程序执行的两种变体（1 和 2）。 Hence, 4 variants in total: A1, A2, B1, B2.因此，总共有 4 个变体：A1、A2、B1、B2。

Variant A1 (--gres=gpu:2, mpirun)变体 A1 (--gres=gpu:2, mpirun)

Variant A2 (--gres=gpu:2, srun)变体 A2 (--gres=gpu:2, srun)

In both variants A1 and A2, the job executes correctly with optimal performance, we have the following output in the log:在变体 A1 和 A2 中，作业以最佳性能正确执行，我们在日志中有以下 output：

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 3: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1

Variant B1 (--gpus-per-task=1, mpirun)变体 B1 (--gpus-per-task=1, mpirun)

The job is not executed correctly, GPUs are not mapped correctly due to CUDA_VISIBLE_DEVICES=0 on the second node:作业未正确执行，由于第二个节点上的CUDA_VISIBLE_DEVICES=0 ，GPU 未正确映射：

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0

Note that this variant behaves the same with and without --gpu-bind=single:1 .请注意，无论是否使用--gpu-bind=single:1 ，此变体的行为都相同。

Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)变体 B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)

GPUs are mapped correctly (now each process sees only one GPU because of --gpu-bind=single:1 ): GPU 映射正确（现在每个进程只看到一个 GPU 因为--gpu-bind=single:1 ）：

Rank 0: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 1: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1

However, an MPI error appears when the ranks start to communicate (similar message is repeated once for each rank):但是，当排名开始通信时，会出现 MPI 错误（每个排名重复一次类似的消息）：

--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  Hostname:                         gp11
  cuIpcOpenMemHandle return value:  217
  address:                          0x7f40ee000000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory.  Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------

Although it says "This is an unrecoverable error", the execution seems to proceed just fine, except the log is littered with messages like this (assuming one message per MPI communication call):虽然它说“这是一个不可恢复的错误”，但执行似乎进行得很好，除了日志中充斥着这样的消息（假设每个 MPI 通信调用一条消息）：

[gp11:122211] Failed to register remote memory, rc=-1
[gp11:122212] Failed to register remote memory, rc=-1
[gp12:62725] Failed to register remote memory, rc=-1
[gp12:62724] Failed to register remote memory, rc=-1

Clearly this is an OpenMPI error message.显然这是一条 OpenMPI 错误消息。 I found an old thread about this error, which suggested to use --mca btl_smcuda_use_cuda_ipc 0 to disable CUDA IPC.我发现了一个关于这个错误的旧线程，它建议使用--mca btl_smcuda_use_cuda_ipc 0来禁用 CUDA IPC。 However, since srun was used in this case to launch the program, I have no idea how to pass such parameters to OpenMPI.但是，由于在这种情况下使用srun来启动程序，我不知道如何将这些参数传递给 OpenMPI。

Note that in this variant --gpu-bind=single:1 affects only the visible GPUs ( CUDA_VISIBLE_DEVICES ).请注意，在此变体中--gpu-bind=single:1仅影响可见 GPU ( CUDA_VISIBLE_DEVICES )。 But even without this option, each task is still able to select the right GPU and the errors still appear.但是即使没有这个选项，每个任务仍然可以 select 正确的 GPU 并且错误仍然出现。

Any idea what is going on and how to address the errors in variants B1 and B2?知道发生了什么以及如何解决变体 B1 和 B2 中的错误吗？ Ideally we would like to use --gpus-per-task which is more flexible than --gres=gpu:... (it's one less parameter to change when we change --ntasks-per-node ).理想情况下，我们希望使用比--gres=gpu:...更灵活--gpus-per-task （当我们更改--ntasks-per-node时，它需要更改的参数少了一个）。 Using mpirun vs srun does not matter to us.使用mpirun与srun对我们来说并不重要。

We have Slurm 20.11.5.1, OpenMPI 4.0.5 (built with --with-cuda and --with-slurm ) and CUDA 11.2.2.我们有 Slurm 20.11.5.1、OpenMPI 4.0.5（使用--with-cuda和--with-slurm ）和 CUDA 11.2.2。 The operating system is Arch Linux.操作系统是 Arch Linux。 The network is 10G Ethernet (no InfiniBand or OmniPath).网络是 10G 以太网（没有 InfiniBand 或 OmniPath）。 Let me know if I should include more info.让我知道我是否应该包含更多信息。

Answer 1

I'm having a related issue.我有一个相关的问题。 Running with与运行

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:1

will result in the processes sharing a single GPU将导致进程共享单个 GPU

"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

This I suppose is correct.我认为这是正确的。

Running with与运行

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1

will result in only the last process receiving a GPU将导致只有最后一个进程收到 GPU

"PROCID=2: No devices found."
"PROCID=3: No devices found."
"PROCID=0: No devices found."
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)"

Note different IDs in consecutive runs注意连续运行中的不同 ID

"PROCID=2: No devices found."
"PROCID=1: No devices found."
"PROCID=3: No devices found."
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

Running with与运行

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

will result in each process having access to all 4 GPUs将导致每个进程都可以访问所有 4 个 GPU

"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"

Running with与运行

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=single:1

will again result in only the last process receiving a GPU将再次导致仅最后一个进程收到 GPU

"PROCID=1: No devices found."
"PROCID=0: No devices found."
"PROCID=3: No devices found."
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0)"

Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

问题描述

Variant A1 (--gres=gpu:2, mpirun)变体 A1 (--gres=gpu:2, mpirun)

Variant A2 (--gres=gpu:2, srun)变体 A2 (--gres=gpu:2, srun)

Variant B1 (--gpus-per-task=1, mpirun)变体 B1 (--gpus-per-task=1, mpirun)

Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)变体 B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)

1 个解决方案

解决方案1
0 2022-02-03 14:17:24

Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

问题描述

Variant A1 (--gres=gpu:2, mpirun)变体 A1 (--gres=gpu:2, mpirun)

Variant A2 (--gres=gpu:2, srun)变体 A2 (--gres=gpu:2, srun)

Variant B1 (--gpus-per-task=1, mpirun)变体 B1 (--gpus-per-task=1, mpirun)

Variant B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)变体 B2 (--gpus-per-task=1, --gpu-bind=single:1, srun)

1 个解决方案

解决方案1 0 2022-02-03 14:17:24

解决方案1
0 2022-02-03 14:17:24