简体繁体 English

来自多个进程的并发GPU内核执行

[英]Concurrent GPU kernel execution from multiple processes

原文 2012-10-01 19:29:29 2 2 cuda/ opencl/ gpu/ nvidia/ amd-processor

I have an application in which I would like to share a single GPU between multiple processes. 我有一个应用程序，我希望在多个进程之间共享一个GPU。 That is, each of these processes would create its own CUDA or OpenCL context, targeting the same GPU. 也就是说，这些进程中的每一个都会创建自己的CUDA或OpenCL上下文，目标是相同的GPU。 According to the Fermi white paper[1], application-level context switching is less then 25 microseconds, but the launches are effectively serialized as they launch on the GPU -- so Fermi wouldn't work well for this. 根据Fermi白皮书[1]，应用程序级上下文切换小于25微秒，但是在GPU上启动时启动有效地序列化 - 因此Fermi不能很好地完成这项工作。 According to the Kepler white paper[2], there is something called Hyper-Q that allows for up to 32 simultaneous connections from multiple CUDA streams, MPI processes, or threads within a process. 根据Kepler白皮书[2]，有一种叫做Hyper-Q的东西允许来自多个CUDA流，MPI进程或进程内线程的多达32个同时连接。

My questions: Has anyone tried this on a Kepler GPU and verified that its kernels are run concurrently when scheduled from distinct processes? 我的问题：是否有人在Kepler GPU上尝试过此操作并验证其内核是否在从不同进程调度时同时运行？ Is this just a CUDA feature, or can it also be used with OpenCL on Nvidia GPUs? 这只是一个CUDA功能，还是可以在Nvidia GPU上与OpenCL一起使用？ Do AMD's GPUs support something similar? AMD的GPU是否支持类似的东西？

[1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf [1] http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf

[2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf [2] http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

2 个解决方案

In response to the first question, NVIDIA has published some hyper-Q results in a blog here . 在回答第一个问题，NVIDIA已经发布在博客上的一些超-Q结果在这里。 The blog is pointing out that the developers who were porting CP2K were able to get to accelerated results more quickly because hyper-Q allowed them to use the application's MPI structure more or less as-is and run multiple ranks on a single GPU, and get higher effective GPU utilization that way. 该博客指出，移植CP2K的开发人员能够更快地获得加速结果，因为hyper-Q允许他们或多或少地使用应用程序的MPI结构并在单个GPU上运行多个级别，并获得这种方式有效的GPU利用率更高。 As mentioned in the comments, this (hyper-Q) feature is only available on K20 processors currently, as it is dependent on the GK110 GPU. 正如评论中所提到的，这个（hyper-Q）功能目前仅适用于K20处理器，因为它依赖于GK110 GPU。

I've run simultaneous kernels from Fermi architecture it works wonderfully and in fact, is often the only way to get high occupancy from your hardware. 我从Fermi架构运行同步内核它运行得非常好，事实上，它通常是从硬件中获得高占用率的唯一方法。 I used OpenCL and you need to run a separate command queue from a separate cpu thread in order to do this. 我使用OpenCL并且您需要从单独的cpu线程运行单独的命令队列才能执行此操作。 Hyper-Q is the ability to dispatch new data parallel kernels from within another kernel. Hyper-Q能够从另一个内核中调度新数据并行内核。 This is only on Kepler. 这只是在开普勒。