简体   繁体   中英

On an NVIDIA host with 2 GPUs, how can two remote users use one gpu each by srun command under SLURM

I have an NVIDIA host with 2 GPUs and there are two different remote users that need to use a GPU on that host. When each one executes its tasks by srun, which are managed by SLURM, for one of them the GPU resources are released immediately, but for another it stays in a queue waiting for resources. But there are two GPUs. Why doesn't everyone get a GPU? I have already tried several alternatives, they were in the parameters, but it seems that when using SRUN, in the interactive form, the person who manages to execute his job has the whole domain of the machine until he finishes his job.

Assuming Slurm is correctly be configured to allow node sharing ( SelectType option ), and to manage GPUs as generic resources ( GresType option ), you could use scontrol show node and compare the AllocTRES and CfgTRES outputs.

This would show what resources are available and find out why job 2 is pending. Maybe job 1 used parameter --exclusive ? Maybe job 1 requested all the CPUs or all the memory? Maybe job 1 requested all GPUs? etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM