We have a small HPC with 4*64 cores, and it has SLURM installed in it.
The nodes are:
sinfo -N -l
Mon Oct 3 08:58:12 2016
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
dlab-node1 1 dlab* idle 64 2:16:2 257847 0 1 (null) none
dlab-node2 1 dlab* idle 64 2:16:2 257847 0 1 (null) none
dlab-node3 1 dlab* idle 64 2:16:2 257847 0 1 (null) none
dlab-node4 1 dlab* idle 64 2:16:2 257847 0 1 (null) none
To test the SLURM i wrote a little script in python with multiprocessing:
import multiprocessing
import os
def func(i):
print(n_procs)
n_procs = int(os.environ['SLURM_JOB_CPUS_PER_NODE'].split('(')[0]) * int(os.environ['SLURM_JOB_NUM_NODES'])
p = multiprocessing.Pool(n_procs)
list(p.imap_unordered(func, [i for i in range(n_procs*2)]))
I use the following batch sh
script to run it with SLURM
#!/bin/bash
#
#SBATCH -p dlab # partition (queue)
#SBATCH -N 2 # number of nodes
#SBATCH -n 64 # number of cores
#SBATCH --mem 250 # memory pool for all cores
#SBATCH -t 0-2:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
python3 asd.py
As i would expect this would print 128
256
times to the STDOUT file.
However if i run this multiple times i get very different amount of lines (they all contain 128
which is correct)
For the first run i got 144 lines, the second time i got 256 (which is correct) and the third time i get 184.
What is the problem? Should i investigate something inside the configuration of SLURM, or there is something wrong within python multiprocessing
?
From sbatch man page:
SLURM_JOB_CPUS_PER_NODE
Count of processors available to the job on this node . Note the select/linear plugin allocates entire nodes to jobs, so the value indicates the total count of CPUs on the node. The select/cons_res plugin allocates individual processors to jobs, so this number indicates the number of processors on this node allocated to the job
As highlighted, the variable will only return the number on cpus allocated in the node where the script is running. If you want to have an homogeneous allocation you should specify --ntasks-per-node=32
Also, bear in mind that multiprocessing will not spawn processes in more than one node. If you want to span multiple nodes you have a nice documentation here
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.