gpucompute* 在 slurm 集群中關閉*

Question

我的 gpucompute 節點處於關閉狀態，無法在 GPU 節點上發送作業。 在遵循網絡上的所有解決方案后，我無法返回我的“關閉 GPU”節點。 在出現此問題之前，在通過運行“NVIDIA-Linux-x86_64-410.79.run --no”解決該錯誤后，我遇到了 Nvidia 驅動程序配置錯誤，無法通過“nvidia-smi”檢測到 GPU -drm' 我遇到了這個錯誤，這是因為節點處於關閉狀態。 非常感謝任何幫助！

command: sbatch md1.s
sbatch: error: Batch job submission failed: Requested node configuration is not available

command:  sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpucompute*    up   infinite      1  down* fwb-lab-tesla1

command:  sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       slurm     2020-09-25T13:13:19 fwb-lab-tesla1

 command: sinfo -Nl
Fri Sep 25 16:35:25 2020
NODELIST        NODES   PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
fwb-lab-tesla1      1 gpucompute*       down*   32   32:1:1  64000        0      1   (null)Not responding 


command: vim /etc/slurm/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=FWB-Lab-Tesla
#ControlAddr=137.72.38.102
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
#SlurmUser=slurm
SlurmdUser=root
StateSaveLocation=/var/spool/slurm/StateSave
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30

command: ls /etc/init.d
functions  livesys  livesys-late  netconsole  network  README

command: nvidia-smi
Fri Sep 25 16:35:01 2020    

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN V             Off  | 00000000:02:00.0 Off |                  N/A |
| 24%   32C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   35C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN V             Off  | 00000000:83:00.0 Off |                  N/A |
| 30%   44C    P8    N/A /  N/A |      0MiB / 12036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN V             Off  | 00000000:84:00.0 Off |                  N/A |
| 31%   42C    P8    N/A /  N/A |      0MiB / 12036MiB |      6%      Default |
---------------------------------------------------------------------------+
                                                                               
----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |

Answer 1

這個問題你提到可能防止slurmd上守護gpucompute啟動。 您應該能夠通過運行systemctl status slurmd或 Linux 發行版的等效命令來確認這一點。

slurmd日志可能包含類似於

slurmd[1234]: fatal: can't stat gres.conf file /dev/nvidia0: No such file or directory

嘗試重新啟動它

systemctl start slurmd

一旦確定nvidia-smi響應正確。

Answer 2

我的問題通過以下說明解決了。 請記住，無論何時重新啟動系統，都需要在重新啟動后輸入命令。 感謝 Joan Bryan 解決了這個問題！


slurmd -Dcvvv
reboot
ps -ef | grep slurm
kill xxxx (this is Process id number in the output of previous ps ef command)
nvidia-smi
systemctl start slurmctld
systemctl start slurmd
scontrol update nodename=fwb-lab-tesla1 state=idle

now you can run the jobs on the GPU nodes!
Cheers

gpucompute* 在 slurm 集群中關閉*

問題描述

2 個解決方案

解決方案1
0 已采納 2020-09-26 11:33:53

解決方案2
0 2020-10-07 13:44:47

gpucompute* 在 slurm 集群中關閉*

問題描述

2 個解決方案

解決方案1 0 已采納 2020-09-26 11:33:53

解決方案2 0 2020-10-07 13:44:47

解決方案1
0 已采納 2020-09-26 11:33:53

解決方案2
0 2020-10-07 13:44:47