简体   繁体   English

Slurm 请求的节点配置不可用

[英]Slurm Requested node configuration is not available

hello everyones so im trying to set up a new hpc cluster i made an account and added users and im using a partition but whenerver i run a job it gives me an error that request node configuration is not available i checked my slurm.conf but it seems good to me i need some help the error Batch job submission failed: Requested node configuration is not available大家好,所以我试图建立一个新的 hpc 集群我创建了一个帐户并添加了用户并且我使用了一个分区但是当我运行一个作业时它给了我一个错误,请求节点配置不可用我检查了我的 slurm.conf 但它对我来说似乎很好我需要一些帮助错误Batch job submission failed: Requested node configuration is not available

   #
# See the slurm.conf man page for more information.
#

SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
SlurmdSpoolDir=/cm/local/apps/slurm/var/spool
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
#ProctrackType=proctrack/pgid
ProctrackType=proctrack/cgroup
#PluginDir=
#FirstJobId=
ReturnToService=2
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/cgroup
#TrackWCKey=no
#TreeWidth=50
#TmpFs=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd

#JobCompType=jobcomp/filetxt
#JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log

#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherType=jobacct_gather/cgroup
#JobAcctGatherFrequency=30
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
# AccountingStorageLoc=slurm_acct_db
# AccountingStoragePass=SLURMDBD_USERPASS

# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
# Server nodes
SlurmctldHost=omics-master
AccountingStorageHost=master
# Nodes
NodeName=omics[01-05] Procs=48 Feature=local
# Partitions
PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=omics[01-05]
ClusterName=omics
# Scheduler
SchedulerType=sched/backfill
# Statesave
StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/omics
PrologFlags=Alloc
# Generic resources types
GresTypes=gpu
# Epilog/Prolog section
Prolog=/cm/local/apps/cmd/scripts/prolog
Epilog=/cm/local/apps/cmd/scripts/epilog
# Power saving section (disabled)
# END AUTOGENERATED SECTION   -- DO NOT REMOVE

and this is my sinfo这是我的信息

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      5   idle omics[01-05]

and this is my test script这是我的测试脚本

   #!/bin/bash
#SBATCH --nodes=2                       # Number of nodes
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2
#SBATCH --output=std.out
#SBATCH --error=std.err
#SBATCH --mem-per-cpu=1gb
echo "hello from:"
hostname; pwd; date;
sleep 10
echo "going to sleep during 10 seconds"
echo "wake up, exiting 

" "

and thanks in advance并提前感谢

In the node definition, you do not specify RealMemory so Slurm assumes the default of 1MB (.) per node.在节点定义中,您没有指定RealMemory ,因此 Slurm 假定每个节点默认为 1MB (.)。 Hence the request of 1GB per CPU cannot be fulfilled因此无法满足每个 CPU 1GB 的请求

You should run slurmd -C on the compute node, that will give you the line to insert in the slurm.conf file for Slurm to correctly know the hardware resources it can allocate.您应该在计算节点上运行slurmd -C ,这将为您提供在slurm.conf文件中插入的行,以便 Slurm 正确了解它可以分配的硬件资源。

$ slurmd -C | head -1
NodeName=node002 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=128547

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM