简体   繁体   English

Slurm多处理Python作业

[英]Slurm Multiprocessing Python Job

I have a 4 node Slurm cluster, each with 6 cores. 我有一个4节点Slurm集群,每个集群有6个核心。 I would like to submit a test Python script (it spawns processes that print the hostname of the node it's being run on) utilizing Multiprocessing as follows: 我想提交一个测试Python脚本(它产生打印正在运行的节点的主机名的进程),利用Multiprocessing,如下所示:

def print_something():
  print gethostname()

# number of processes allowed to run on the cluster at a given time
n_procs = int(environ['SLURM_JOB_CPUS_PER_NODE']) * int(environ['SLURM_JOB_NUM_NODES'])
# tell Python how many processes can run at a time
pool = Pool(n_procs)
# spawn an arbitrary number of processes
for i in range(200):
    pool.apply_async(print_something)
pool.close()
pool.join()

I submit this with an SBATCH script, which specifies nodes=4 and ntasks-per-node=6, but I am finding that the Python script gets executed 4*6 times. 我用SBATCH脚本提交它,它指定nodes = 4和ntasks-per-node = 6,但我发现Python脚本执行了4 * 6次。 I just want the job to execute the script once, and allow Slurm to distribute the process spawns across the cluster. 我只希望作业执行一次脚本,并允许Slurm在集群中分发进程生成。

I'm obviously not understanding something here...? 我显然不明白这里的东西......?

Ok, I figured it out. 好的,我明白了。

I needed to have a better understanding of the relationship between SBATCH and SRUN. 我需要更好地了解SBATCH和SRUN之间的关系。 Mainly, SBATCH may act as a global job container for SRUN invocations. 主要是,SBATCH可以作为SRUN调用的全局工作容器。

The biggest factor here was changing from Python Multiprocessing to Subprocess. 这里最大的因素是从Python Multiprocessing变为Subprocess。 This way, the SBATCH can execute a python script, which in turn can dynamically invoke SRUN subprocesses of another python script, and allocate cluster resources appropriately. 这样,SBATCH可以执行python脚本,该脚本又可以动态调用另一个python脚本的SRUN子进程,并适当地分配集群资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM