简体   繁体   English

在分配给其他节点之前,无法让 slurm sbatch 作业数组将作业分配给核心

[英]Having trouble getting slurm sbatch job arrays to assign jobs to cores before assigning to additional nodes

I have a number of jobs that require a single core to run.我有许多作业需要单核才能运行。 The cluster I use has 5 nodes, each with 96 cores.我使用的集群有 5 个节点,每个节点有 96 个核心。 When I use slurm to submit the jobs, the jobs are always assigned to multiple nodes and if more than 5 (ie, number of nodes) they tend to run sequentially as opposed to concurrently on each node.当我使用 slurm 提交作业时,作业总是分配给多个节点,如果超过 5 个(即节点数),它们往往会按顺序运行,而不是在每个节点上同时运行。 The same behaviour is observed when I restrict the nodes;当我限制节点时观察到相同的行为; sequential, not concurrent.顺序的,不是并发的。 The configuration is set to "cons_tres" and I have tried many different suggestions and combinations of the script below.配置设置为“cons_tres”,我尝试了许多不同的建议和下面脚本的组合。 I did manage to get the desired operation using the $SLURM_PROCID accessed through a wrapper script, but I need to access data throughout the run for each model and have found the $SLURM_ARRAY_TASK_ID very convenient for this.我确实设法使用通过包装脚本访问的 $SLURM_PROCID 获得所需的操作,但我需要在每个模型的整个运行过程中访问数据,并且发现 $SLURM_ARRAY_TASK_ID 对此非常方便。 I have tried submitting with srun within the sbatch script, but nothing seems to work.我尝试在 sbatch 脚本中使用 srun 提交,但似乎没有任何效果。 The last iteration with the optional srun inclusion is shown below.包含可选 srun 的最后一次迭代如下所示。 I am pretty new (~1 week) to the development of scheduling scripts, so please forgive any incorrect/inaccurate descriptions.我对调度脚本的开发还很陌生(约 1 周),所以请原谅任何不正确/不准确的描述。 I really appreciate any solutions, but am also looking to more fully understand where I am going wrong.我真的很感激任何解决方案,但我也希望更全面地了解我哪里出错了。 Thanks!谢谢!

#!/bin/tcsh
## SLURM TEST

#SBATCH --job-name=seatest
#SBATCH --nodes=1-1
#SBATCH --ntasks=5
#SBATCH --ntasks-per-node=5
#SBATCH --array=1-5
#SBATCH --output=slurm-%A_%03a.out

hostname

set CASE_NUM=`printf %03d $SLURM_ARRAY_TASK_ID`

[srun] program-name seatest.$CASE_NUM.in

This jobs were sent to 1 core of each of the five nodes, not to 5 cores of 1 node.这些作业被发送到五个节点中每个节点的 1 个核心,而不是 1 个节点的 5 个核心。

Memory based scheduling was enabled on the cluster, which required the memory (--mem) for each job to be specified.集群上启用了基于内存的调度,这需要为每个作业指定内存 (--mem)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM