简体   繁体   中英

SLURM with python multiprocessing give inconsistent results

We have a small HPC with 4*64 cores, and it has SLURM installed in it.

The nodes are:

sinfo -N -l
Mon Oct  3 08:58:12 2016
NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON              
dlab-node1      1     dlab*        idle   64   2:16:2 257847        0      1   (null) none                
dlab-node2      1     dlab*        idle   64   2:16:2 257847        0      1   (null) none                
dlab-node3      1     dlab*        idle   64   2:16:2 257847        0      1   (null) none                
dlab-node4      1     dlab*        idle   64   2:16:2 257847        0      1   (null) none  

To test the SLURM i wrote a little script in python with multiprocessing:

import multiprocessing
import os
def func(i):
    print(n_procs)

n_procs = int(os.environ['SLURM_JOB_CPUS_PER_NODE'].split('(')[0]) * int(os.environ['SLURM_JOB_NUM_NODES'])
p = multiprocessing.Pool(n_procs)
list(p.imap_unordered(func, [i for i in range(n_procs*2)]))

I use the following batch sh script to run it with SLURM

#!/bin/bash
#
#SBATCH -p dlab                # partition (queue)
#SBATCH -N 2                      # number of nodes
#SBATCH -n 64                     # number of cores
#SBATCH --mem 250                 # memory pool for all cores
#SBATCH -t 0-2:00                 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out        # STDOUT
#SBATCH -e slurm.%N.%j.err        # STDERR

python3 asd.py

As i would expect this would print 128 256 times to the STDOUT file.

However if i run this multiple times i get very different amount of lines (they all contain 128 which is correct)

For the first run i got 144 lines, the second time i got 256 (which is correct) and the third time i get 184.

What is the problem? Should i investigate something inside the configuration of SLURM, or there is something wrong within python multiprocessing ?

From sbatch man page:

SLURM_JOB_CPUS_PER_NODE

Count of processors available to the job on this node . Note the select/linear plugin allocates entire nodes to jobs, so the value indicates the total count of CPUs on the node. The select/cons_res plugin allocates individual processors to jobs, so this number indicates the number of processors on this node allocated to the job

As highlighted, the variable will only return the number on cpus allocated in the node where the script is running. If you want to have an homogeneous allocation you should specify --ntasks-per-node=32

Also, bear in mind that multiprocessing will not spawn processes in more than one node. If you want to span multiple nodes you have a nice documentation here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM