[英]slurm, srun and parallelism
I have seen the following two very similar schemes used when submitting jobs with multiple steps to slurm: 提交具有多个步骤的作业时,我已经看到以下两种非常相似的方案:
On the one hand you can do 一方面你可以做
#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..4}; do
srun --exclusive -n1 script $j &
done
srun --exclusive -n1 script 5
wait
On the other hand you can do 另一方面,你可以做
#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..5}; do
srun --exclusive -n1 script $j &
done
wait
Because the job should have only 5 CPUs allocated to it I don't really understand how the second one can work correctly, since after four job steps have been started with srun there is no way the scheduler can allocate a fifth job 'in the background' and then return to the original script... where would the original script run? 因为该作业应该只分配了5个CPU,所以我不太了解第二个如何正常工作,因为在使用srun启动了四个作业步骤之后,调度程序无法在后台分配第五个作业',然后返回原始脚本...原始脚本将在哪里运行? (I admit my knowledge of these things is pretty limited though). (尽管我承认我对这些事情的了解非常有限)。
However, I have personally tested both ways and they both seem to work exactly the same. 但是,我已经亲自测试了这两种方式,并且它们似乎都完全相同。 The second script is a bit simpler in my opinion, and when dealing with somewhat larger input scripts this can be an advantage, but I'm worried that I don't understand 100% what is going on here. 我认为第二个脚本要简单一些,当处理较大的输入脚本时这可能是一个优势,但是我担心我不了解这里发生的事情的100%。 Is there a preferred way to do this? 有没有首选的方法可以做到这一点? What is the difference? 有什么区别? What is Bash/slurm doing behind the scenes? Bash / Slurm在幕后做什么?
They both work the same in principle, though the second one is clearer (and correct - see below). 它们的原理都相同,尽管第二个更清晰(并且正确-见下文)。 Each invocation of srun
will run script
on a separate CPU (probably - though if it runs very fast they could run on a subset of the sbatch-allocated CPUs). 每次对srun
调用都会在单独的CPU上运行script
(可能-尽管运行速度非常快,它们可以在分配了sbatch的CPU的子集上运行)。
I think the first example doesn't need wait
, since the last command isn't run in the background. 我认为第一个示例不需要wait
,因为最后一个命令不在后台运行。
By the way, the first example has an error: %j
is local to the for-loop, so the last run inside the loop and the run outside the loop both invoke script 4
. 顺便说一句,第一个示例有一个错误: %j
在for循环中是本地的,因此循环内的最后一次运行和循环外的运行都会调用script 4
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.