如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID？

Question

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job?当我将带有选项 --gres=gpu:1 的 SLURM 作业提交给具有两个 GPU 的节点时，如何获取分配给该作业的 GPU 的 ID？ Is there an environment variable for this purpose?是否有用于此目的的环境变量？ The GPUs I'm using are all nvidia GPUs.我使用的 GPU 都是 nvidia GPU。 Thanks.谢谢。

Answer 1

You can get the GPU id with the environment variable CUDA_VISIBLE_DEVICES .您可以使用环境变量CUDA_VISIBLE_DEVICES获取 GPU id。 This variable is a comma separated list of the GPU ids assigned to the job.此变量是分配给作业的 GPU id 的逗号分隔列表。

Answer 2

Slurm stores this information in an environment variable, SLURM_JOB_GPUS . Slurm 将此信息存储在环境变量SLURM_JOB_GPUS 。

One way to keep track of such information is to log all SLURM related variables when running a job, for example (following Kaldi 's slurm.pl , which is a great script to wrap Slurm jobs) by including the following command within the script run by sbatch :跟踪此类信息的一种方法是在运行作业时记录所有与 SLURM 相关的变量，例如（遵循Kaldi的slurm.pl ，这是一个很好的包装 Slurm 作业的脚本）通过在脚本运行中包含以下命令通过sbatch ：

set | grep SLURM | while read line; do echo "# $line"; done

Answer 3

You can check the environment variables SLURM_STEP_GPUS or SLURM_JOB_GPUS for a given node:您可以检查给定节点的环境变量SLURM_STEP_GPUS或SLURM_JOB_GPUS ：

echo ${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}

Note CUDA_VISIBLE_DEVICES may not correspond to the real value ( see @isarandi's comment ).注意CUDA_VISIBLE_DEVICES可能与实际值不对应（请参阅@isarandi 的评论）。

Also, note this should work for non-Nvidia GPUs as well.另外，请注意这也适用于非 Nvidia GPU。

如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID？

问题描述

3 个解决方案

解决方案1
6 已采纳 2017-05-14 19:37:46

解决方案2
2 2019-07-21 01:54:40

解决方案3
1 2021-01-13 20:12:08

如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID？

问题描述

3 个解决方案

解决方案1 6 已采纳 2017-05-14 19:37:46

解决方案2 2 2019-07-21 01:54:40

解决方案3 1 2021-01-13 20:12:08

解决方案1
6 已采纳 2017-05-14 19:37:46

解决方案2
2 2019-07-21 01:54:40

解决方案3
1 2021-01-13 20:12:08