简体繁体 English

2个GPU的NVIDIA主机上，SLURM下srun命令两个远程用户如何各用一个gpu

[英]On an NVIDIA host with 2 GPUs, how can two remote users use one gpu each by srun command under SLURM

原文 2022-11-22 20:45:22 6 1 gpu/ nvidia/ slurm

I have an NVIDIA host with 2 GPUs and there are two different remote users that need to use a GPU on that host.我有一台带 2 个 GPU 的 NVIDIA 主机，有两个不同的远程用户需要在该主机上使用 GPU。 When each one executes its tasks by srun, which are managed by SLURM, for one of them the GPU resources are released immediately, but for another it stays in a queue waiting for resources.当每个人都通过 SLURM 管理的 srun 执行任务时，其中一个会立即释放 GPU 资源，但另一个会留在队列中等待资源。 But there are two GPUs.但是有两个GPU。 Why doesn't everyone get a GPU?为什么不是每个人都得到一个 GPU？ I have already tried several alternatives, they were in the parameters, but it seems that when using SRUN, in the interactive form, the person who manages to execute his job has the whole domain of the machine until he finishes his job.我已经尝试了几种选择，它们在参数中，但似乎在使用 SRUN 时，以交互形式，设法执行他的工作的人拥有机器的整个域，直到他完成他的工作。

1 个解决方案

Assuming Slurm is correctly be configured to allow node sharing ( SelectType option ), and to manage GPUs as generic resources ( GresType option ), you could use scontrol show node and compare the AllocTRES and CfgTRES outputs.假设 Slurm 已正确配置为允许节点共享（ SelectType 选项），并将 GPU 作为通用资源进行管理（ GresType 选项），您可以使用scontrol show node并比较AllocTRES和CfgTRES输出。

This would show what resources are available and find out why job 2 is pending.这将显示可用的资源并找出作业 2 挂起的原因。 Maybe job 1 used parameter --exclusive ?也许作业 1 使用了参数--exclusive ？ Maybe job 1 requested all the CPUs or all the memory?也许作业 1 请求所有 CPU 或所有 memory？ Maybe job 1 requested all GPUs?也许作业 1 请求了所有 GPU？ etc.等等

Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun - GPU allocation in Slurm: --gres vs --gpus-per-task, and mpirun vs srun

如何为每个任务设置 1 gpu 的 slurm/salloc 但让作业使用多个 gpu？ - How to set slurm/salloc for 1 gpu per task but let job use multiple gpus?

如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID？ - How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

如何在 Azure ML 和 NVIDIA CUDA 自定义 docker 基础映像上使用 GPU？ - How can I use GPUs on Azure ML with a NVIDIA CUDA custom docker base image?

如何正确 label 并配置 Kubernetes 以使用 Nvidia GPU？ - How to properly label and configure Kubernetes to use Nvidia GPUs?

如何在Nvidia GPU上调试OpenCL？ - How to debug OpenCL on Nvidia GPUs?

NVIDIA GPU 中的 if 语句是如何执行的？ - How is if statement executed in NVIDIA GPUs?

Slurm 作业不能为多个节点请求 GPU 资源 - A Slurm job can't request GPUs resources for more than one node

鉴于有多个 GPU 可用，如何在 TF2 中使用专用 GPU？ - How to use dedicated GPU with TF2, given that multiple GPUs are available?

如何使用Slurm访问群集中不同节点上的GPU？ - How to access to GPUs on different nodes in a cluster with Slurm?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun - GPU allocation in Slurm: --gres vs --gpus-per-task, and mpirun vs srun 如何为每个任务设置 1 gpu 的 slurm/salloc 但让作业使用多个 gpu？ - How to set slurm/salloc for 1 gpu per task but let job use multiple gpus? 如何在多个 GPU 节点上获取分配给 SLURM 作业的 GPU ID？ - How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node? 如何在 Azure ML 和 NVIDIA CUDA 自定义 docker 基础映像上使用 GPU？ - How can I use GPUs on Azure ML with a NVIDIA CUDA custom docker base image? 如何正确 label 并配置 Kubernetes 以使用 Nvidia GPU？ - How to properly label and configure Kubernetes to use Nvidia GPUs? 如何在Nvidia GPU上调试OpenCL？ - How to debug OpenCL on Nvidia GPUs? NVIDIA GPU 中的 if 语句是如何执行的？ - How is if statement executed in NVIDIA GPUs? Slurm 作业不能为多个节点请求 GPU 资源 - A Slurm job can't request GPUs resources for more than one node 鉴于有多个 GPU 可用，如何在 TF2 中使用专用 GPU？ - How to use dedicated GPU with TF2, given that multiple GPUs are available? 如何使用Slurm访问群集中不同节点上的GPU？ - How to access to GPUs on different nodes in a cluster with Slurm?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM