简体   繁体   English

您能帮我在Slurm中并行运行任务吗?

[英]Can you help me run tasks in parallel in Slurm?

I am new to Slurm and I trying to launch several executables to run in parallel (in the example below it is just the date command). 我是Slurm的新手,我试图启动多个可执行文件以并行运行(在下面的示例中,这只是date命令)。 I would like them to start at different times, separated by a short time delay. 我希望它们在不同的时间开始,并以短暂的时间间隔分开。

I have made a few attemps, trying to add additional lines in between the sruns, such as "srun sleep 5s &" or with the "--begin" option shown below. 我做了一些尝试,尝试在srun之间添加其他行,例如“ srun sleep 5s&”或使用下面显示的“ --begin”选项。 In particular, the "--begin" option fails saying that "--begin is ignored because nodes are already allocated". 特别是,“-begin”选项无法说“ --begin被忽略,因为已经分配了节点”。

The parallel module seems not to be available in our cluster. 并行模块似乎在我们的集群中不可用。

#!/bin/bash
#SBATCH --output=parallel_test_%j.out   # Standard output and error log
#SBATCH --time=06:00:00
#SBATCH --nodes=1   # number of nodes
#SBATCH --ntasks=6   
#SBATCH --mem-per-cpu=1024M   # memory per CPU core

srun="srun -n1 -N1 --exclusive"
# --exclusive     ensures srun uses distinct CPUs for each job step
# -N1 -n1         allocates a single core to each task


$srun date &
$srun --begin=now+3 date &
$srun --begin=now+6 date &
$srun --begin=now+9 date &
$srun --begin=now+12 date &
$srun --begin=now+15 date &
wait

The output I get is the following: 我得到的输出如下:

srun: error: --begin is ignored because nodes are already allocated.
srun: error: --begin is ignored because nodes are already allocated.
srun: error: --begin is ignored because nodes are already allocated.
srun: error: --begin is ignored because nodes are already allocated.
srun: error: --begin is ignored because nodes are already allocated.
Sun Jun 23 14:07:05 PDT 2019
Sun Jun 23 14:07:05 PDT 2019
Sun Jun 23 14:07:05 PDT 2019
Sun Jun 23 14:07:05 PDT 2019
Sun Jun 23 14:07:05 PDT 2019
Sun Jun 23 14:07:06 PDT 2019

What I would like to obtain is the following output: 我想获得的是以下输出:

Sun Jun 23 13:22:54 PDT 2019
Sun Jun 23 13:22:57 PDT 2019
Sun Jun 23 13:23:00 PDT 2019
Sun Jun 23 13:23:03 PDT 2019
Sun Jun 23 13:23:06 PDT 2019
Sun Jun 23 13:23:09 PDT 2019

Thank you for your help 谢谢您的帮助

In this case, --begin will be of no help because it is used to defer the initiation of the job, and the job already started when srun is run in the submission script. 在这种情况下,-- --begin将无济于事,因为它用于推迟作业的启动,并且在提交脚本中运行srun时该作业已经启动。

You can get the requested behaviour like this: 您可以得到以下请求的行为:

$srun date &
sleep 3; $srun date &
sleep 3; $srun date &
sleep 3; $srun date &
sleep 3; $srun date &
sleep 3; $srun date &
wait

or even like this 甚至像这样

$srun date &
$srun bash -c "sleep 3 ; date" &
$srun bash -c "sleep 6 ; date" &
$srun bash -c "sleep 9 ; date" &
$srun bash -c "sleep 12 ; date" &
$srun bash -c "sleep 15 ; date" &
wait

Regarding 关于

The parallel module seems not to be available in our cluster 并行模块似乎在我们的集群中不可用

that does not mean you cannot install it by yourself (See this question ). 这并不意味着您不能自己安装它(请参阅此问题 )。 If Easybuild is installed on your cluster, it is even easier. 如果将Easybuild安装在您的集群上,则更加简单。 (If it is not, you can also install it by yourself) Then you can use the --delay option. (如果不是,您也可以自己安装它)然后可以使用--delay选项。

parallel --delay 3 $srun date

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM