简体繁体 English

多个主机线程启动单个CUDA内核

[英]Multiple host threads launching individual CUDA kernels

原文 2012-09-06 05:56:18 8 2 cuda/ cuda-streams

For my CUDA development, I am using a machine with 16 cores, and 1 GTX 580 GPU with 16 SMs. 对于我的CUDA开发，我使用的是具有16个内核的机器，以及1个带有16个SM的GTX 580 GPU。 For the work that I am doing, I plan to launch 16 host threads (1 on each core), and 1 kernel launch per thread, each with 1 block and 1024 threads. 对于我正在做的工作，我计划启动16个主机线程（每个核心1个），每个线程启动1个内核，每个线程包含1个块和1024个线程。 My goal is to run 16 kernels in parallel on 16 SMs. 我的目标是在16个SM上并行运行16个内核。 Is this possible/feasible? 这可能/可行吗？

I have tried to read as much as possible about independent contexts, but there does not seem to be too much information available. 我试图尽可能多地阅读关于独立上下文的内容，但似乎没有太多可用的信息。 As I understand it, each host thread can have its own GPU context. 据我了解，每个主机线程都可以拥有自己的GPU上下文。 But, I am not sure whether the kernels will run in parallel if I use independent contexts. 但是，如果我使用独立的上下文，我不确定内核是否会并行运行。

I can read all the data from all 16 host threads into one giant structure and pass it to GPU to launch one kernel. 我可以将所有16个主机线程中的所有数据读入一个巨型结构，并将其传递给GPU以启动一个内核。 However, it will be too much copying and it will slow down the application. 但是，复制太多会降低应用程序的速度。

2 个解决方案

You can only have one context on a GPU at a time. 一次只能在GPU上有一个上下文。 One way to achieve the sort of parallelism you require would be to use CUDA streams. 实现所需的并行性的一种方法是使用CUDA流。 You can create 16 streams inside the context, and launch memcopies and kernels into streams by name. 您可以在上下文中创建16个流，并按名称将内存和内核启动到流中。 You can read more in a quick webinar on using streams at : http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf . 您可以在以下网站的快速网络研讨会中阅读更多内容： http ： //developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf 。 The full API reference is in the CUDA toolkit manuals. 完整的API参考在CUDA工具包手册中。 The CUDA 4.2 manual is available at http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Toolkit_Reference_Manual.pdf . 可以在http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_Toolkit_Reference_Manual.pdf上找到CUDA 4.2手册。

While a multi-threaded application can hold multiple CUDA contexts simultaneously on the same GPU, those contexts cannot perform operations concurrently. 虽然多线程应用程序可以在同一GPU上同时保存多个CUDA上下文，但这些上下文不能同时执行操作。 When active, each context has sole use of the GPU, and must yield before another context (which could include operations with a rendering API or a display manager) can have access to the GPU. 当激活时，每个上下文都只使用GPU，并且必须在另一个上下文（可能包括使用渲染API或显示管理器的操作）之前产生访问权限。

So in a word, no this strategy can't work with any current CUDA versions or hardware. 总而言之，没有这种策略不适用于任何当前的CUDA版本或硬件。