如何通过使用worker / master概念的slurm在不同的节点上运行不同的独立并行作业？

Question

I have a program that uses the master/salve concept for parallelization. 我有一个使用主/从概念进行并行化的程序。 There is a master directory and multiple worker directories. 有一个主目录和多个工作目录。 I should first run the executive file in the master directory, then go to the working directories and run the working executive in each directory. 我应该首先在master目录中运行执行文件，然后转到工作目录并在每个目录中运行工作执行程序。 The master waits for the worker to finish their jobs and send the results to the master for further calculations. 主机等待工人完成工作并将结果发送给主机进行进一步计算。 The jobs of working directories are independent of each other so they can be run on different machines (nodes). 工作目录的作业彼此独立，因此它们可以在不同的机器（节点）上运行。 The master and workers communicate with each other using the TCP/IP communications protoco. 主机和工作者使用TCP / IP通信协议相互通信。
I'm working on a cluster with 16 nodes and each node has 28 cores with slurm job manager. 我正在一个有16个节点的集群上工作，每个节点都具有28个带有Slurm Job Manager的核心。 I can run my jobs with 20 workers on 1 node totaly fine. 我可以在1个节点上与20个工人一起工作。 currently my slurm script looks like this: 目前，我的Slurm脚本如下所示：

#!/bin/bash
#SBATCH -n 1               # total number of tasks requested
#SBATCH --cpus-per-task=18 # cpus to allocate per task
#SBATCH -p shortq            # queue (partition) -- defq, eduq, gpuq.
#SBATCH -t 12:00:00        # run time (hh:mm:ss) - 12.0 hours in this.

cd /To-master-directory
master.exe /h :4004 &
MASTER_PID=$!

cd /To-Parent 
# This is the directory that contains all worker (wrk)directories

parallel -i bash -c "cd {} ; worker.exe /h 127.0.0.1:4004" -- 
wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14 
wrk15 wrk16 wrk17 wrk18 wrk19 wrk20
kill ${MASTER_PID}

I was wondering how can I modify this script to divide jobs running on workers between multiple nodes. 我想知道如何修改此脚本以在多个节点之间划分在工人上运行的作业。 For example, jobs associated with the wrk1 to wrk5 run on node 1, jobs associated with the wrk6 to wrk10 run on node 2 etc? 例如，与wrk1到wrk5关联的作业在节点1上运行，与wrk6到wrk10关联的作业在节点2上运行，等等？

Answer 1

First, you need to let Slurm allocate distinct nodes for your job, so you need to remove the --cpus-per-task option and rather ask for 18 tasks . 首先，您需要让Slurm为您的工作分配不同的节点，因此您需要删除--cpus-per-task选项，而是要求18个任务。

Second, you need to get the hostname where the master runs as 127.0.0.1 will no longer be valid in a multi-node setup. 其次，你需要得到hostname其中主运行为127.0.0.1将不再有效，在多节点设置。

Third, just add srun before the call the bash in parallel . 第三，只需在parallel调用bash之前添加srun 。 With the --exclusive -n 1 -c 1 , it will dispatch each instance of the worker spawned by parallel to each of the CPUs in the allocation. 使用--exclusive -n 1 -c 1 ，它将分派parallel生成的工作程序的每个实例到分配中的每个CPU。 They might be on the same node or on other nodes. 它们可能在同一节点上，也可能在其他节点上。

So the following could work (untested) 因此，以下方法可能会起作用（未经测试）

#!/bin/bash
#SBATCH -n 18               # total number of tasks requested
#SBATCH -p shortq            # queue (partition) -- defq, eduq, gpuq.
#SBATCH -t 12:00:00        # run time (hh:mm:ss) - 12.0 hours in this.

cd /To-master-directory
master.exe /h :4004 &
MASTER_PID=$!
MASTER_HOSTNAME=$(hostname)

cd /To-Parent 
# This is the directory that contains all worker (wrk)directories

parallel -i srun --exclusive -n 1 -c 1 bash -c "cd {} ; worker.exe /h $MASTER_HOSTNAME:4004" -- 
wrk1 wrk2 wrk3 wrk4 wrk5 wrk6 wrk7 wrk8 wrk9 wrk10 wrk11 wrk12 wrk13 wrk14 
wrk15 wrk16 wrk17 wrk18 wrk19 wrk20
kill ${MASTER_PID}

Note that in your example with 18 tasks and 20 directories to process, the job will first run 18 workers and then the two additional ones will be 'micro-scheduled' whenever a previous task finishes. 请注意，在您的示例中，有18个任务和20个要处理的目录，该作业将首先运行18个工作程序，然后每当上一个任务完成时，将对另外两个工作程序进行“微计划”。

如何通过使用worker / master概念的slurm在不同的节点上运行不同的独立并行作业？

问题描述

1 个解决方案

解决方案1
1 2019-01-24 15:16:18

如何通过使用worker / master概念的slurm在不同的节点上运行不同的独立并行作业？

问题描述

1 个解决方案

解决方案1 1 2019-01-24 15:16:18

解决方案1
1 2019-01-24 15:16:18