简体   繁体   English

如何在HPC群集上通过python使用所有分配的节点

[英]How to use all allocated nodes with python on a HPC cluster

I have a HPC cluster with SLURM installed. 我有一个安装了SLURM的HPC群集。 I can properly allocate nodes and cores for myself. 我可以为自己正确分配节点和核心。 I would like to be able to use all the allocated cores regardless of the node they are in. As i seen in this thread Using the multiprocessing module for cluster computing this cannot be achieved with multiprocessing . 我希望能够使用所有分配的核心,而不管它们位于哪个节点。正如我在此线程中看到的那样, 使用多处理模块进行集群计算无法通过multiprocessing实现。

My script look like this (oversimplified version): 我的脚本如下所示(简化版):

def func(input_data):
    #lots of computing
    return data

parallel_pool = multiprocessing.Pool(processes=300)
returned_data_list = []
for i in parallel_pool.imap_unordered(func, lots_of_input_data)
    returned_data_list.append(i)
# Do additional computing with the returned_data
....

This script works perfectly fine, however as i mentioned multiprocessing is not a good tool for me, as even if SLURM allocated 3 nodes for me, multiprocessing can only use one. 该脚本可以正常工作,但是由于我提到多处理对我来说不是一个好工具,即使SLURM为我分配了3个节点,多处理也只能使用一个。 As far as i understand this is a limitation of multiprocessing. 据我了解,这是多重处理的局限性。

I could use the srun protocol of SLURM, but that ust executes the same script N times, and i need additional computing with the output of the parallel processes. 我可以使用SLURM的srun协议,但是必须执行相同的脚本N次,并且我需要使用并行进程的输出进行额外的计算。 I could of course store the outputs somewhere, and ream em back in, but there must be some more elegant solution. 我当然可以将输出存储在某个地方,然后重新输入,但是必须有一些更优雅的解决方案。

In the mentioned thread there are suggestions like jug , but as i was reading through it i havet found a solution for myself. 在提到的主题中,有一些建议,例如jug ,但是当我阅读它时,我还没有为自己找到解决方案。

Maybe py4mpi can be a solution for me? 也许py4mpi对我来说可以解决方案? The tutorials for that seems very messy, and i havent found a specific solution for my problem in there neither. 该教程似乎非常混乱,我也没有找到解决我的问题的具体解决方案。 (run a function in parallel with mpi, and then continue the script). (与mpi并行运行一个函数,然后继续执行脚本)。

I tried subprocess calls, but the seem to work the same way as multiprocess calls, so they only run on one node. 我尝试了subprocess调用,但似乎与multiprocess调用的工作方式相同,因此它们仅在一个节点上运行。 I havent found any confirmation of this, so this is only from my trial-and-error guess. 我尚未找到任何确认,因此这仅是基于我的反复试验猜测。

How can i overcome this problem? 我该如何克服这个问题? Currently i could use more than 300 cores, but one node only have 32, so if i could find a solution then i would be able to run my project nearly 10 times as fast. 目前,我可以使用300多个核,但是一个节点只有32个核,因此,如果我能找到解决方案,那么我将能够以近10倍的速度运行我的项目。

Thanks 谢谢

很多麻烦的后scoop是解决我的问题库。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM