解决SLURM“sbatch：错误：批量作业提交失败：请求的节点配置不可用”错误

Question

We have a 4 GPU nodes with 2 36-core CPUs and 200 GB of RAM available at our local cluster.我们的本地集群有 4 个 GPU 节点，其中包含 2 个 36 核 CPU 和 200 GB 的 RAM。 When I'm trying to submit a job with the follwoing configuration:当我尝试使用以下配置提交作业时：

#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1500MB
#SBATCH --gres=gpu:4
#SBATCH --time=0-10:00:00

I'm getting the following error:我收到以下错误：

sbatch: error: Batch job submission failed: Requested node configuration is not available sbatch：错误：批处理作业提交失败：请求的节点配置不可用

What might be the reason for this error?此错误的原因可能是什么？ The nodes have exactly the kind of hardware that I need...节点具有我需要的那种硬件......

Answer 1

The CPUs are most likely 36-threads not 36-cores and Slurm is probably configured to allocate cores and not threads. CPU 很可能是 36 线程而不是 36 核，并且 Slurm 可能配置为分配内核而不是线程。

Check the output of scontrol show nodes to see what the nodes really offer.检查scontrol show nodes的输出以查看节点真正提供的内容。

Answer 2

You're requesting 40 tasks on nodes with 36 CPUs.您在具有 36 个 CPU 的节点上请求 40 个任务。 The default SLURM configuration binds tasks to cores, so reducing the tasks to 36 or fewer may work.默认的 SLURM 配置将任务绑定到核心，因此将任务减少到 36 或更少可能会起作用。 (Or increases nodes to 2, if your application can handle that) （或者将节点增加到 2，如果您的应用程序可以处理）

解决SLURM“sbatch：错误：批量作业提交失败：请求的节点配置不可用”错误

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-03-29 13:22:35

解决方案2
0 2019-03-22 06:33:29

解决SLURM“sbatch：错误：批量作业提交失败：请求的节点配置不可用”错误

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-03-29 13:22:35

解决方案2 0 2019-03-22 06:33:29

解决方案1
5 已采纳 2019-03-29 13:22:35

解决方案2
0 2019-03-22 06:33:29