简体   繁体   English

SLURM和python,节点已分配,但代码仅在一个节点上运行

[英]SLURM and python, nodes are allocated, but the code only runs on one node

I have a 4*64 CPU cluster. 我有一个4 * 64 CPU群集。 I installed SLURM, and it seems to be working, as if i call sbatch i get the proper allocation and queue. 我安装了SLURM,它似乎正在工作,就好像我调用sbatch我得到了正确的分配和队列。 However if i use more than 64 cores (so basically more than 1 node) it perfectly allocates the correct amount of nodes, but if i ssh into the allocated nodes i only see actual work in one of them. 但是如果我使用超过64个内核(所以基本上超过1个节点),它完美地分配节点的正确数量,但如果我ssh到分配的节点,我只看到其中的一个实际工作。 The rest just sits there doing nothing. 其余的只是坐在那里什么都不做。

My code is complex, and it uses multiprocessing . 我的代码很复杂,它使用multiprocessing I call pools with like 300 workers, so i guess it should not be the problem. 我打电话给有300名工人的游泳池,所以我想这应该不是问题所在。

What i would like to achieve is to call sbatch myscript.py on like 200 cores, and SLURM should distribute my run on these 200 cores, not just allocate the correct amount of nodes but actually only use one. 我想要实现的是在200个内核上调用sbatch myscript.py ,而SLURM应该在这200个内核上分配我的运行,而不仅仅是分配正确数量的节点,而实际上只使用一个。

The header of my python script looks like this: 我的python脚本的标题如下所示:

#!/usr/bin/python3

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

and i call the script with sbatch myscript.py . 我用sbatch myscript.py调用脚本。

Unfortunately, multiprocessing does not allow working on several nodes. 不幸的是, multiprocessing不允许在多个节点上工作。 From the documentation : 文档

the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine 多处理模块允许程序员充分利用给定机器上的多个处理器

One option, often used with Slurm, is to use MPI (with the MPI4PY package) but MPI is considered to be the 'the assembly language of parallel programming' and you will need to modify your code extensibly. 一个通常与Slurm一起使用的选项是使用MPI (使用MPI4PY包)但MPI被认为是'并行编程的汇编语言',您需要可扩展地修改代码。

Another option is to look into the Parallel Processing packages for one that suits your needs and requires minimal changes to your code. 另一种选择是查看并行处理软件包,以满足您的需求,并且只需对代码进行最少的更改。 See also this other question for more insights. 有关更多见解,请参阅此其他问题

A final note: it is perfectly fine to put the #SBATCH directives in the Python script and use the Python shebang. 最后要注意的是:将#SBATCH指令放在Python脚本中并使用Python shebang是完全可以的。 But as Slurm executes a copy of the script rather than the script itself, you must add a line such as 但是当Slurm执行脚本的副本而不是脚本本身时,你必须添加一行如

sys.path.append(os.getcwd()) 

at the beginning of the script (but after the #SBATCH lines) to make sure Python finds any module located in your directory. 在脚本的开头(但在#SBATCH行之后),以确保Python找到位于目录中的任何模块。

I think your sbatch script should not be inside the python script. 我认为你的sbatch脚本不应该在python脚本中。 Rather it should be a normal bash script including the #SBATCH options followed by the actual script to run with srun jobs. 相反,它应该是一个普通的bash脚本,包括#SBATCH选项,后跟实际脚本与srun作业一起运行。 like the following: 如下:

#!/usr/bin/bash

#SBATCH --output=SLURM_%j.log
#SBATCH --partition=part
#SBATCH -n 200

srun python3 myscript.py

I suggest testing this with a simple python script like this: 我建议用这样一个简单的python脚本来测试它:

import multiprocessing as mp

def main():
    print("cpus =", mp.cpu_count())

if __name__ == "__main__":
    main()

I tried to get around using different python libraries by using srun on the following bash script. 我试图通过在以下bash脚本上使用srun来使用不同的python库。 srun should run on each node that you have allocated to you. srun应该在您分配给您的每个节点上运行。 The basic idea is that it determines what node it's running on and assigns a node id of 0, 1, ... , nnodes-1. 基本思想是它确定它运行的节点并分配节点id为0,1,...,nnodes-1。 Then it passes that information off to the python program along with a thread id. 然后它将该信息与线程id一起传递给python程序。 In the program I combine these two numbers to make a distinct id for each cpu on each node. 在程序中,我将这两个数字组合在一起,为每个节点上的每个cpu创建一个不同的id。 This code assumes that there are 16 cores on each node and 10 nodes are going to be used. 此代码假定每个节点上有16个核心,并且将使用10个节点。

#!/bin/bash

nnames=(`scontrol show hostnames`)
nnodes=${#nnames[@]}
nIDs=`seq 0 $(($nnodes-1))`
nID=0
for i in $nIDs
do
    hname=`hostname`
    if [ "${nnames[$i]}" == "$hname" ]
        then nID=$i
    fi
done
tIDs=`seq 0 15`

for tID in $tIDs
do
    python testDataFitting2.py $nID $tID 160 &
done
wait

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM