简体   繁体   English

口吃,冒犯和并行

[英]slurm, srun and parallelism

I have seen the following two very similar schemes used when submitting jobs with multiple steps to slurm: 提交具有多个步骤的作业时,我已经看到以下两种非常相似的方案:

On the one hand you can do 一方面你可以做

#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..4}; do
    srun --exclusive -n1 script $j &
done
srun --exclusive -n1 script 5
wait

On the other hand you can do 另一方面,你可以做

#SBATCH -N1 -c1 -n5 # 5 tasks in total on 1 node, 1 cpu per task
for j in {1..5}; do
    srun --exclusive -n1 script $j &
done
wait

Because the job should have only 5 CPUs allocated to it I don't really understand how the second one can work correctly, since after four job steps have been started with srun there is no way the scheduler can allocate a fifth job 'in the background' and then return to the original script... where would the original script run? 因为该作业应该只分配了5个CPU,所以我不太了解第二个如何正常工作,因为在使用srun启动了四个作业步骤之后,调度程序无法在后台分配第五个作业',然后返回原始脚本...原始脚本将在哪里运行? (I admit my knowledge of these things is pretty limited though). (尽管我承认我对这些事情的了解非常有限)。

However, I have personally tested both ways and they both seem to work exactly the same. 但是,我已经亲自测试了这两种方式,并且它们似乎都完全相同。 The second script is a bit simpler in my opinion, and when dealing with somewhat larger input scripts this can be an advantage, but I'm worried that I don't understand 100% what is going on here. 我认为第二个脚本要简单一些,当处理较大的输入脚本时这可能是一个优势,但是我担心我不了解这里发生的事情的100%。 Is there a preferred way to do this? 有没有首选的方法可以做到这一点? What is the difference? 有什么区别? What is Bash/slurm doing behind the scenes? Bash / Slurm在幕后做什么?

They both work the same in principle, though the second one is clearer (and correct - see below). 它们的原理都相同,尽管第二个更清晰(并且正确-见下文)。 Each invocation of srun will run script on a separate CPU (probably - though if it runs very fast they could run on a subset of the sbatch-allocated CPUs). 每次对srun调用都会在单独的CPU上运行script (可能-尽管运行速度非常快,它们可以在分配了sbatch的CPU的子集上运行)。

I think the first example doesn't need wait , since the last command isn't run in the background. 我认为第一个示例不需要wait ,因为最后一个命令不在后台运行。

By the way, the first example has an error: %j is local to the for-loop, so the last run inside the loop and the run outside the loop both invoke script 4 . 顺便说一句,第一个示例有一个错误: %j在for循环中是本地的,因此循环内的最后一次运行和循环外的运行都会调用script 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM