There is a down state on my gpucompute nodes and cant send the jobs on GPU nodes. I couldn't return my 'down GPU' nodes after following all the solutions on the net. Before this problem, I had an error with the Nvidia driver configuration in a way that I couldn't detect the GPUs by 'nvidia-smi', after solving that error by running 'NVIDIA-Linux-x86_64-410.79.run --no-drm' I have encountered this error that is because of the down state of the nodes. Appreciate very much any help!
command: sbatch md1.s
sbatch: error: Batch job submission failed: Requested node configuration is not available
command: sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpucompute* up infinite 1 down* fwb-lab-tesla1
command: sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2020-09-25T13:13:19 fwb-lab-tesla1
command: sinfo -Nl
Fri Sep 25 16:35:25 2020
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
fwb-lab-tesla1 1 gpucompute* down* 32 32:1:1 64000 0 1 (null)Not responding
command: vim /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
command: ls /etc/init.d
functions livesys livesys-late netconsole network README
command: nvidia-smi
Fri Sep 25 16:35:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:02:00.0 Off | N/A |
| 24% 32C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V Off | 00000000:03:00.0 Off | N/A |
| 23% 35C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN V Off | 00000000:83:00.0 Off | N/A |
| 30% 44C P8 N/A / N/A | 0MiB / 12036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN V Off | 00000000:84:00.0 Off | N/A |
| 31% 42C P8 N/A / N/A | 0MiB / 12036MiB | 6% Default |
---------------------------------------------------------------------------+
----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
The problem you mentioned probably prevented the slurmd
daemon on gpucompute
from starting. You should be able to confirm that by running systemctl status slurmd
or the equivalent command for your Linux distribution.
The slurmd
logs probably contain a line similar to
slurmd[1234]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory
Try restarting it with
systemctl start slurmd
once you made sure nvidia-smi
responded correctly.
My problem solved with the below instructions. Remember that you need to enter the commands after reboot anytime you restart the system. Thanks to Joan Bryan for resolving this!
slurmd -Dcvvv
reboot
ps -ef | grep slurm
kill xxxx (this is Process id number in the output of previous ps ef command)
nvidia-smi
systemctl start slurmctld
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle
now you can run the jobs on the GPU nodes!
Cheers
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.