简体   繁体   English

如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID?

[英]How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job?当我将带有选项 --gres=gpu:1 的 SLURM 作业提交给具有两个 GPU 的节点时,如何获取分配给该作业的 GPU 的 ID? Is there an environment variable for this purpose?是否有用于此目的的环境变量? The GPUs I'm using are all nvidia GPUs.我使用的 GPU 都是 nvidia GPU。 Thanks.谢谢。

You can get the GPU id with the environment variable CUDA_VISIBLE_DEVICES .您可以使用环境变量CUDA_VISIBLE_DEVICES获取 GPU id。 This variable is a comma separated list of the GPU ids assigned to the job.此变量是分配给作业的 GPU id 的逗号分隔列表。

Slurm stores this information in an environment variable, SLURM_JOB_GPUS . Slurm 将此信息存储在环境变量SLURM_JOB_GPUS

One way to keep track of such information is to log all SLURM related variables when running a job, for example (following Kaldi 's slurm.pl , which is a great script to wrap Slurm jobs) by including the following command within the script run by sbatch :跟踪此类信息的一种方法是在运行作业时记录所有与 SLURM 相关的变量,例如(遵循Kaldislurm.pl ,这是一个很好的包装 Slurm 作业的脚本)通过在脚本运行中包含以下命令通过sbatch

set | grep SLURM | while read line; do echo "# $line"; done

You can check the environment variables SLURM_STEP_GPUS or SLURM_JOB_GPUS for a given node:您可以检查给定节点的环境变量SLURM_STEP_GPUSSLURM_JOB_GPUS

echo ${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}

Note CUDA_VISIBLE_DEVICES may not correspond to the real value ( see @isarandi's comment ).注意CUDA_VISIBLE_DEVICES可能与实际值不对应( 请参阅@isarandi 的评论)。

Also, note this should work for non-Nvidia GPUs as well.另外,请注意这也适用于非 Nvidia GPU。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM