gpucompute* is down* in slurm cluster

Question

There is a down state on my gpucompute nodes and cant send the jobs on GPU nodes. I couldn't return my 'down GPU' nodes after following all the solutions on the net. Before this problem, I had an error with the Nvidia driver configuration in a way that I couldn't detect the GPUs by 'nvidia-smi', after solving that error by running 'NVIDIA-Linux-x86_64-410.79.run --no-drm' I have encountered this error that is because of the down state of the nodes. Appreciate very much any help!

command: sbatch md1.s
sbatch: error: Batch job submission failed: Requested node configuration is not available

command:  sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpucompute*    up   infinite      1  down* fwb-lab-tesla1

command:  sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2020-09-25T13:13:19 fwb-lab-tesla1

 command: sinfo -Nl
Fri Sep 25 16:35:25 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*       down*   32   32:1:1  64000        0      1   (null)Not responding 


command: vim /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30

command: ls /etc/init.d
functions  livesys  livesys-late  netconsole  network  README

command: nvidia-smi
Fri Sep 25 16:35:01 2020    

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:02:00.0 Off |                  N/A |
| 24%   32C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   35C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             Off  | 00000000:83:00.0 Off |                  N/A |
| 30%   44C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN V             Off  | 00000000:84:00.0 Off |                  N/A |
| 31%   42C    P8    N/A /  N/A |      0MiB / 12036MiB |      6%      Default |
---------------------------------------------------------------------------+
                                                                               
----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |

Answer 1

The problem you mentioned probably prevented the slurmd daemon on gpucompute from starting. You should be able to confirm that by running systemctl status slurmd or the equivalent command for your Linux distribution.

The slurmd logs probably contain a line similar to

slurmd[1234]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

Try restarting it with

systemctl start slurmd

once you made sure nvidia-smi responded correctly.

Answer 2

My problem solved with the below instructions. Remember that you need to enter the commands after reboot anytime you restart the system. Thanks to Joan Bryan for resolving this!


slurmd -Dcvvv
reboot
ps -ef | grep slurm
kill xxxx (this is Process id number in the output of previous ps ef command)
nvidia-smi
systemctl start slurmctld
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle

now you can run the jobs on the GPU nodes!
Cheers

gpucompute* is down* in slurm cluster

Question

2 answers

solution1
0 ACCPTED 2020-09-26 11:33:53

solution2
0 2020-10-07 13:44:47

gpucompute* is down* in slurm cluster

Question

2 answers

solution1 0 ACCPTED 2020-09-26 11:33:53

solution2 0 2020-10-07 13:44:47

solution1
0 ACCPTED 2020-09-26 11:33:53

solution2
0 2020-10-07 13:44:47