与SLURM和有限的资源并行运行MPI调用队列

Question

I'm trying to run a Particle Swarm Optimization problem on a cluster using SLURM, with the optimization algorithm managed by a single-core matlab process. 我正在尝试使用SLURM在群集上运行粒子群优化问题，并使用由单核matlab进程管理的优化算法。 Each particle evaluation requires multiple MPI calls that alternate between two Python programs until the result converges. 每个粒子评估都需要在两个Python程序之间交替进行多个MPI调用，直到结果收敛为止。 Each MPI call takes up to 20 minutes. 每个MPI呼叫最多需要20分钟。

I initially naively submitted each MPI call as a separate SLURM job, but the resulting queue time made it slower than running each job locally in serial. 最初，我天真地将每个MPI调用作为单独的SLURM作业提交，但是所产生的队列时间使其比在本地串行运行每个作业要慢。 I am now trying to figure out a way to submit an N node job that will continuously run MPI tasks to utilize the available resources. 我现在正试图找出一种提交N节点作业的方法，该作业将连续运行MPI任务以利用可用资源。 The matlab process would manage this job with text file flags. Matlab进程将使用文本文件标志来管理此作业。

Here is a pseudo-code bash file that might help to illustrate what I am trying to do on a smaller scale: 这是一个伪代码bash文件，可能有助于说明我试图在较小范围内执行的操作：

#!/bin/bash

#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt” == 1>
do
  for i in {0..40}
  do
    if <“RunJob_i.txt” == 1>
    then
      mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
    fi
  done
done

wait

This approach doesn't work (just crashes), but I don't know why (probably overutilization of resources?). 这种方法行不通（只是崩溃），但我不知道为什么（可能资源过度利用？）。 Some of my peers have suggested using parallel with srun , but as far as I can tell this requires that I call the MPI functions in batches. 我的一些同僚建议将srun parallel使用，但据我所知，这要求我批量调用MPI函数。 This will be a huge waste of resources, as a significant portion of the runs finish or fail quickly (this is expected behavior). 这将浪费大量资源，因为很大一部分运行会很快完成或失败（这是预期的行为）。 A concrete example of the problem would be starting a batch of 5 8-core jobs and having 4 of them crash immediately; 一个具体的问题示例是开始一批5个8核作业，并立即使其中4个崩溃。 now 32 cores would be doing nothing while they wait up to 20 minutes for the 5th job to finish. 现在，当第32个内核等待20分钟才能完成第5个工作时，它们什么也不做。

Since the optimization will likely require upwards of 5000 mpi calls, any increase in efficiency will make a huge difference in absolute walltime. 由于优化可能需要多达5000 mpi的调用，因此效率的任何提高都将在绝对时间上产生巨大差异。 Does anyone have any advice as to how I could run a constant stream of MPI calls on a large SLURM job? 是否有人对我如何在大型SLURM作业上连续运行MPI调用有任何建议？ I would really appreciate any help. 我真的很感谢您的帮助。

Answer 1

A couple of things: under SLURM you should be using srun, not mpirun. 有两件事：在SLURM下，您应该使用srun，而不是mpirun。 The second thing is that the pseudo-code you provided launches an infinite number of jobs without waiting for any completion signal. 第二件事是您提供的伪代码将启动无限数量的作业，而无需等待任何完成信号。 You should try to put the wait into the inner loop, so you launch just a set of jobs, wait for them to finish, evaluate the condition and, maybe, launch the next set of jobs: 您应该尝试将等待放入内部循环中，以便仅启动一组作业，等待它们完成，评估条件，然后启动下一组作业：

#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt” == 1>
do
  for i in {0..40}
  do
    if <“RunJob_i.txt” == 1>
    then
      srun -np 8 --exclusive <job_i> &
    fi
  done
  wait
  <Update "KeepRunning.txt”>

done

Take care also distinguishing tasks and cores. 还要注意区分任务和核心。 -n says how many tasks will be used, -c says how many cpus per task will be allocated. -n表示将使用多少个任务，-c表示每个任务将分配多少cpus。

The code I wrote will launch in the background 41 jobs (from 0 to 40, included), but they will only start once the resources are available (--exclusive), waiting while they are occupied. 我编写的代码将在41个作业（包括0到40，在后台）中启动，但是只有在资源可用时（-不包括），它们才会启动，并等待它们被占用。 Each jobs will use 8 CPUs. 每个作业将使用8个CPU。 The you will wait for them to finish and I assume that you will update the KeepRunning.txt after that round. 您将等待它们完成，并且我认为您将在该回合之后更新KeepRunning.txt。

与SLURM和有限的资源并行运行MPI调用队列

问题描述

1 个解决方案

解决方案1
1 2018-03-07 11:01:38

与SLURM和有限的资源并行运行MPI调用队列

问题描述

1 个解决方案

解决方案1 1 2018-03-07 11:01:38

解决方案1
1 2018-03-07 11:01:38