简体   繁体   中英

How to know the status of each process of one job in the slurm cluster manager?

After using the Slurm cluster manager to sbatch a job with multiple processes, is there a way to know the status (running or finishing) of each process? Can it be implemented in a python script?

If the processes you mention are distincts steps, then sacct can give you the information as explained by @Christopher Bottoms.

But if the processes are different tasks in a single step, then you can use this script that uses parallel SSH to run 'ps' commands on the compute nodes and offer a summarised view, as @Tom de Geus suggests.

Just use the command sacct that comes with Slurm.

Given this code ( my.sh ):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=2

srun -n1 sleep 10 &
srun -n1 sleep 3

wait

I run it:

sbatch my.sh

And then check on it with sacct :

sacct

Which gives me per-step info:

     JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
---------- ---------- ---------- ---------- ---------- ---------- --------
8021        my.sbatch    CLUSTER        me          2     RUNNING      0:0
8021.0          sleep                   me          1     RUNNING      0:0
8021.1          sleep                   me          1   COMPLETED      0:0

sacct has a lot of options to customize its output. For example,

sacct --format='JobID%6,State'

Will just give you the IDs (up to 6 characters) and the current state of jobs:

 JobID      State
------ ----------
  8021    RUNNING
8021.0    RUNNING
8021.1  COMPLETED

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM