简体繁体 English

使用Slurm在可用GPU上分发MPI线程程序

[英]Distribution of MPI-threaded program on available GPUs, using slurm

原文 2019-08-06 07:59:36 7 1 gpu/ mpi/ slurm

My program consists out of two parts, A and B, both written in C++. 我的程序由A和B两部分组成，两者都是用C ++编写的。 B is loaded from a separate DLL, and is capable of running both on the CPU or on the GPU, depending on how it is linked. B是从单独的DLL加载的，并且可以根据链接的方式在CPU或GPU上运行。 When the main program is launched, it creates one instance of A, which in turn creates one instance of B (which then either works on the locally available CPUs or on the first GPU). 启动主程序时，它将创建A的一个实例，然后又创建一个B的实例（然后可以在本地可用的CPU或第一个GPU上运行）。
When launching the program using mpirun (or via slurm , which in turn launches mpirun ), for each MPI rank one version of A is created, which in turn creates one version of B for itself. 使用mpirun （或通过slurm ，依次启动mpirun ）启动程序时，会为每个MPI等级创建一个版本的A，从而为其本身创建一个版本的B。 When only one GPU is in the system, this GPU will be used, but what happens if there are multiple GPUs in the system? 当系统中只有一个GPU时，将使用该GPU，但是如果系统中有多个GPU，会发生什么呢？ Are versions of B all placed on the same GPU, regardless if there are several GPUs available, or are they distributed evenly? B版本是否都放置在同一个GPU上，而不管是否有多个GPU可用，或者它们分布均匀？
Is there any way to influence that behavior? 有什么方法可以影响这种行为？ Unfortunately my development machine does not have multiple GPUs, thus I can not test it, except on production. 不幸的是，我的开发机器没有多个GPU，因此除生产环境外，我无法对其进行测试。

1 个解决方案

Slurm supports and understands binding MPI ranks to GPUs through for-example the --gpu-bind option: https://slurm.schedmd.com/gres.html . Slurm通过例如--gpu-bind选项https://slurm.schedmd.com/gres.html来支持和理解将MPI等级绑定到GPU。 Assuming that the cluster is correctly configured to enforce GPU affinities, this will then allow you assign one GPU per rank even if there are multiple ranks on a single node. 假设已正确配置群集以强制执行GPU亲和力，那么即使单个节点上有多个等级，也可以为每个等级分配一个GPU。

If you want to be able to test this, you could use for example the cudaGetDevice and cudaGetDeviceProperties calls to get the device luid (local unique id) for each rank and then check that there is no duplication of luids within a node. 如果要进行测试，则可以使用例如cudaGetDevice和cudaGetDeviceProperties调用获取每个等级的设备luid（本地唯一ID），然后检查节点内luid是否重复。