简体繁体 English

HPC群集计算机上的python

[英]python on HPC cluster computer

原文 2014-11-10 23:31:59 5 1 python/ subprocess/ mpi/ cluster-computing/ hpc

I asked a question very close to this, but it wasn't answered and since then I hope I have learned to better ask the question. 我问了一个非常接近这个问题的问题，但没有得到回答，从那时起，我希望我学会了更好地提出这个问题。

I was curious as to how run many jobs serially on a Cray XE6 machine. 我很好奇如何在Cray XE6机器上连续运行许多作业。 You usually qsub things with a ccmrun (for a serial job) or an aprun (instead of mpirun or mpiexec). 通常，您可以使用ccmrun（用于串行作业）或aprun（而不是mpirun或mpiexec）来qsub事物。 I first wanted to use the Pool() function, but due to it not being SMP based hardware it would be limited to 32 processors. 我首先想使用Pool（）函数，但是由于它不是基于SMP的硬件，因此只能使用32个处理器。 Even an mpi4py application of something like a pool wouldn't work, because I am not giving the main program all of the processors. 即使是像池这样的mpi4py应用程序也无法正常工作，因为我没有给主程序提供所有处理器。 I would be running that script 64 times if I were to say aprun -n 64 mpipool.py, whereas it does work if I do something like aprun -n 1 -d 32 pool.py. 如果我要说aprun -n 64 mpipool.py，我将运行该脚本64次，而如果我做类似aprun -n 1 -d 32 pool.py的事情，它将确实有效。

I've had a look at the https://wiki.python.org/moin/ParallelProcessing website and was wondering if anyone had any experience running multiple serial jobs on a cluster computing machine with any of them. 我已经看过https://wiki.python.org/moin/ParallelProcessing网站，并且想知道是否有人在集群计算机上运行多个串行作业，并且有经验。 I did write an mpi4py code that basically had rank 0 doing all of the job selection, and then giving them out to the the other processors. 我确实编写了一个mpi4py代码，该代码执行所有作业选择时的等级均为0，然后将其分发给其他处理器。 It didn't want to play nice on the machine since I needed to use subprocess to launch the giant amount of C code. 它并不想在机器上发挥出色，因为我需要使用子进程来启动大量的C代码。 So, one last caveat is that it would have to play nice with subprocess. 因此，最后一个警告是，它必须与子流程配合使用。

I would like to have it look at the amount of nodes chosen, and then basically do something along the lines of: 我想让它查看所选节点的数量，然后基本上按照以下方式进行操作：

ccmrun jobscheduler.py & ccmrun jobrunner.py 63 & # given that I started the job with 64 processors. ccmrun jobscheduler.py＆ccmrun jobrunner.py 63＆＃鉴于我以64个处理器开始了这项工作。 I may have to do a bash loop here, but that's no problem. 我可能需要在这里进行bash循环，但这没问题。

Once started I would want them to be able to communicate between one another, but without MPI I'm not sure of an efficient way of doing this. 一旦开始，我希望他们能够彼此通信，但是如果没有MPI，我不确定这样做的有效方法。 If anyone could get me started on the right path I would greatly appreciate it. 如果有人能让我走上正确的道路，我将不胜感激。 Maybe doing pickle dumps and locking them and deleting them when a jobrunner picks it up. 可能会进行腌制转储并将其锁定，并在工作运行者将其捡起时将其删除。 There might be a really simple way of doing this, but I'm very new to this. 可能有一种非常简单的方法来执行此操作，但是我对此很陌生。

Thanks! 谢谢！

1 个解决方案

I don't know anything about Cray machines but I'll take a stab at this anyway. 我对Cray机器一无所知，但是无论如何我都会刺中。 I noticed you mentioned qsub which makes me think that the system is using PBS or Torque. 我注意到您提到了qsub，这使我认为系统正在使用PBS或Torque。 Both seem to support Job Arrays which may be along the lines of what you are looking for. 两者似乎都支持Job Array，这可能与您要查找的内容相似。

Job Arrays would make the queue system responsible for job management. 作业阵列将使队列系统负责作业管理。 Each subjob would be assigned an array id out of a range you specify and would be assigned whatever resources you requested with -l. 将为每个子作业分配一个数组ID，该数组ID超出您指定的范围，并且将分配给您使用-l请求的任何资源。 In Torque, '#PBS -l nodes=1' and '#PBS -t 1-64' would create 64 subjobs with indexes from 1 to 64 each being assigned a single node. 在Torque中，“＃PBS -l个节点= 1”和“ #PBS -t 1-64”将创建64个子作业，索引从1到64分别分配一个节点。 Man pages and Google will be a good resource and from what I've seen Torque and PBS differ on syntax. 手册页和Google将是一个很好的资源，从我所见，Torque和PBS在语法上有所不同。 If that doesn't work, you can look at using pbsdsh inside of a single, larger job. 如果那行不通，您可以考虑在一个较大的工作中使用pbsdsh。

Also, I want to mention that advice from strangers on the Internet will only take you so far. 另外，我想提及的是，互联网上陌生人的建议只会带您走远。 Your local admin may have limits or scheduling policies in place that you may limit your options. 您的本地管理员可能有限制或调度策略，可能会限制您的选择。 You may also be able to get some advice from the admin about other, proven ways that you can solve your problem. 您也许还可以从管理员那里获得一些建议，以了解其他可以解决问题的可靠方法。