[英]On an NVIDIA host with 2 GPUs, how can two remote users use one gpu each by srun command under SLURM
I have an NVIDIA host with 2 GPUs and there are two different remote users that need to use a GPU on that host.我有一台带 2 个 GPU 的 NVIDIA 主机,有两个不同的远程用户需要在该主机上使用 GPU。 When each one executes its tasks by srun, which are managed by SLURM, for one of them the GPU resources are released immediately, but for another it stays in a queue waiting for resources.
当每个人都通过 SLURM 管理的 srun 执行任务时,其中一个会立即释放 GPU 资源,但另一个会留在队列中等待资源。 But there are two GPUs.
但是有两个GPU。 Why doesn't everyone get a GPU?
为什么不是每个人都得到一个 GPU? I have already tried several alternatives, they were in the parameters, but it seems that when using SRUN, in the interactive form, the person who manages to execute his job has the whole domain of the machine until he finishes his job.
我已经尝试了几种选择,它们在参数中,但似乎在使用 SRUN 时,以交互形式,设法执行他的工作的人拥有机器的整个域,直到他完成他的工作。
Assuming Slurm is correctly be configured to allow node sharing ( SelectType option ), and to manage GPUs as generic resources ( GresType option ), you could use scontrol show node
and compare the AllocTRES
and CfgTRES
outputs.假设 Slurm 已正确配置为允许节点共享( SelectType 选项),并将 GPU 作为通用资源进行管理( GresType 选项),您可以使用
scontrol show node
并比较AllocTRES
和CfgTRES
输出。
This would show what resources are available and find out why job 2 is pending.这将显示可用的资源并找出作业 2 挂起的原因。 Maybe job 1 used parameter
--exclusive
?也许作业 1 使用了参数
--exclusive
? Maybe job 1 requested all the CPUs or all the memory?也许作业 1 请求所有 CPU 或所有 memory? Maybe job 1 requested all GPUs?
也许作业 1 请求了所有 GPU? etc.
等等
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.