简体繁体 English

Cuda优化，多处理器，并发内核执行

[英]Cuda optimization, multiprocessors, concurrent kernel execution

原文 2016-10-21 17:57:47 7 1 optimization/ cuda

I have few questions: (I spent quite some time just trying to find the answers) 我有几个问题：（我花了很多时间只是试图找到答案）

Where can I find information about maximum number of blocks per streaming multiprocessor, on my device? 在哪里可以找到有关每个流式多处理器的最大块数的信息？ (I know it might be 16 blocks but can not confirm it) I need to read it like myDevice.maxBlocksPerMultiProcessor inside the code. （我知道它可能是16个块，但无法确认）。我需要像代码中的myDevice.maxBlocksPerMultiProcessor一样读取它。
Will default kernel launch (eg <<<blocks, threads>>> on default stream 0) spread computations evenly among all multiprocessors? 默认内核启动（例如<<<blocks, threads>>>默认流0上的<<<blocks, threads>>> ）是否将计算平均分配给所有多处理器？ (Or is it that only one Multiprocessor will do the work). （或者是只有一个多处理器可以完成这项工作）。
I understand that it depends on my grid configuration, and i am not asking about that. 我知道这取决于我的网格配置，而我没有问这个。 Lets just assume that i have "performance friendly" grid (i mean block-threads / maxThreadPerMultiProcessors maximization of multiprocessors occupancy). 让我们假设我有一个“性能友好的”网格（我的意思是block-threads / maxThreadPerMultiProcessors最大化多处理器的占用率）。
Will it launch on multiple multiprocessors by default? 默认情况下会在多个多处理器上启动吗？
Lets Say: My GPU supports 16 blocks per multiprocessors & 2048 maxThreadPerMultiProcessors. 说：我的GPU支持每个多处理器16个块和2048个maxThreadPerMultiProcessor。 Then i would like to launch my kernel with <<< N*16, 126 >>> to maximize multiprocessors occupancy. 然后，我想用<<< N*16, 126 >>>启动我的内核，以最大程度地提高多处理器的占用率。 Can I improve performance using streams and / or concurrent kernel execution? 我可以使用流和/或并发内核执行来提高性能吗？
(I do not think so, because i can not get more then 100% multiprocessor occupancy *i knot it sound absurd but my english is not perfect*) （我不这样认为，因为我无法获得超过100％的多处理器占用率*我知道这听起来很荒谬，但我的英语并不完美*）

sorry for my bad english! 对不起，我的英语不好！
thank you for your help! 谢谢您的帮助！

1 个解决方案

Where can I find information about maximum number of blocks per streaming multiprocessor, on my device? 在哪里可以找到有关每个流式多处理器的最大块数的信息？

You can get this information from the programming guide here . 您可以从此处的编程指南中获取此信息。 You'll want to know the compute capability of your device. 您将需要了解设备的计算能力 。 You can look that up here . 您可以在这里查找。 Your device compute capability can also be retrieved programmatically; 您的设备计算能力也可以通过编程方式检索； study the deviceQuery CUDA sample code for an example. 研究deviceQuery CUDA示例代码作为示例。 If you need max blocks per multiprocessor programmatically, you will need to incorporate a version of the table in the programming guide linked above into your program, then use the compute capability to determine it at runtime. 如果以编程方式需要每个多处理器最大块数，则需要将上面链接的编程指南中的表版本合并到程序中，然后使用计算功能在运行时确定该表。
Will default kernel launch (eg <<<blocks, threads>>> on default stream 0) spread computations evenly among all multiprocessors? 默认内核启动（例如<<<blocks, threads>>>默认流0上的<<<blocks, threads>>> ）是否将计算平均分配给所有多处理器？

Yes, this is a fundamental part of the CUDA programming model. 是的，这是CUDA编程模型的基本部分。 As long as you have launched enough blocks to place at least one on each SM, the GPU work distributor will distribute blocks as evenly as it can. 只要您启动了足够的块以在每个SM上至少放置一个块，GPU工作分配器将尽可能均匀地分配块。
Yes, a kernel launch of <<<N, 128>>> where N is sufficiently large, should be an enabling factor to achieve maximum occupancy. 是的，其中N足够大的<<<N, 128>>>内核启动应该是实现最大占用率的启用因素。 Occupancy can have various other limiters (eg registers, shared memory usage, etc.), so this does not guarantee anything, but it should allow for maximum occupancy (2048 threads per SM) in your example. 占用率可以有其他各种限制因素（例如寄存器，共享内存使用情况等），因此这不能保证任何事情，但是在您的示例中应允许最大占用率（每个SM 2048个线程）。 Regarding streams (I think you really mean to ask about concurrent kernels) it's generally true that once you have exposed enough parallelism to saturate a particular GPU, exposing more parallelism may not provide any additional benefit. 关于流（我想你真的是想问并发内核），通常情况是，一旦暴露了足够的并行度以使特定GPU饱和，暴露更多的并行度可能不会提供任何其他好处。 However, it may provide benefit on a future GPU, and furthermore streams allow for things other than just concurrent kernels. 但是，它可能会为将来的GPU提供好处，此外，流还允许并发内核之外的其他功能。 Streams allow for overlap of copy and compute, which may be another valuable factor in improving overall performance. 流允许复制和计算重叠，这可能是提高整体性能的另一个重要因素。

Many of these topics are covered in the programming guide sections 2-5 on the CUDA programming model, hardware implementation, and performance guidelines. 其中许多主题都在CUDA编程模型，硬件实现和性能指南的编程指南第2-5节中涵盖。 The CUDA best practices guide also covers useful related information. CUDA 最佳实践指南还涵盖了有用的相关信息。