简体   繁体   中英

Running a multi-stage job using SLURM

I am new to SLURM. My problem is that I have a multi-stage job, which needs to be run on a cluster, whose jobs are managed by SLURM. Specifically I want to schedule a job which:

  1. Grabs N nodes,
  2. Installs a software on all of them
  3. (once all nodes finish the installation successfully) it creates a database instance on the nodes
  4. Loads the database
  5. (once loading is done successfully) Runs a set of queries, for benchmarking purpose
  6. Drops the database and returns the nodes

Each step could be run using a separate bash script; while the execution of the scripts and transitions between stages are coordinated by a master node.

My problem is that I know how to allocate nodes and call a single command or script on each (which runs as a stand-alone job on each node) using SLURM. But as soon as the command is done (or the called script is finished) on each node, the node returns to pool of free resources, leaving the allocated nodes queue for my job. But the above use case involves several stages/scripts; and needs coordination between them.

I am wondering what the correct way is to design/run a set of scripts for such a use case, using SLURM. Any suggestion or example would be extremely helpful, and highly appreciated.

You simply need to encapsulate all your scripts into a single one for submission:

#!/bin/bash
#SBATCH --nodes=4 --exclusive

# Setting Bash to exit whenever a command exits with a non-zero status.
set -e 
set -o pipefail

echo "Installing software on each of $SLURM_NODELIST"
srun ./install.sh

echo "Creating database instance"
./createDBInstance.sh $SLURM_NODELIST

echo "Loading DB"
./loadDB.sh params

echo Benchmarking
./benchmarks.sh params

echo Done.

You'll need to fill in the blanks... Make sure that your script follow the standard of exiting with a non-zero status on error.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM