简体繁体 English

用于多个内核的Cuda流处理

[英]Cuda Stream Processing for multiple kernels Disambiguation

原文 2013-02-20 12:49:03 3 2 concurrency/ cuda/ cuda-streams

Hi a few questions regarding Cuda stream processing for multiple kernels. 您好，一些有关多个内核的Cuda流处理的问题。 Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32. kernel uses a dev_input array of size n and a dev output array of size s*n. 假设s流和具有3.5功能的开普勒设备中的内核，其中s <=32。内核使用大小为n的dev_input数组和大小为s * n的dev输出数组。 kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid. 内核从输入数组中读取数据，将其值存储在寄存器中，对其进行操作，然后将其结果写回到s * n + tid处的dev_output中。

We aim to run the same kernel s times using one of the n streams each time. 我们的目标是每次使用n个流之一运行相同的内核s次。 Similar to the simpleHyperQ example. 与simpleHyperQ示例相似。 Can you comment if and how any of the following affects concurrency please? 您能否评论以下任何一项以及如何影响并发？

dev_input and dev_output are not pinned; dev_input和dev_output没有固定；
dev_input as it is vs dev_input size s*n, where each kernel reads unique data (no read conflicts) dev_input大小与dev_input大小s * n的关系，其中每个内核读取唯一数据（无读取冲突）
kernels read data from constant memory 内核从常量内存中读取数据
10kb of shared memory are allocated per block. 每个块分配10kb的共享内存。
kernel uses 60 registers 内核使用60个寄存器

Any good comments will be appreciated...!!! 任何好的评论将不胜感激... !!!

cheers, Thanasio 干杯，塔纳西奥

Robert, thanks a lot for your detailed answer. 罗伯特，非常感谢您的详细回答。 It has been very helpful. 这非常有帮助。 I edited 4, it is 10kb per block. 我编辑了4，每个块10kb。 So in my situation, i launch grids of 61 blocks and 256 threads. 因此，在我的情况下，我启动了包含61个块和256个线程的网格。 The kernels are rather computationally bound. 内核相当受计算约束。 I launch 8 streams of the same kernel. 我启动了8个相同内核的流。 Profile them and then i see a very good overlap between the first two and then it gets worse and worse. 剖析它们，然后我发现前两个之间有很好的重叠，然后变得越来越糟。 The kernel execution time is around 6ms. 内核执行时间约为6ms。 After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them. 前两个流几乎完全并发执行后，其余两个流之间的距离为3ms。 Regarding 5, i use a K20 which has a 255 register file. 关于5，我使用的K20具有255个寄存器文件。 So i would not expect drawbacks from there. 因此，我不会期望那里有弊端。 I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.. 我真的不明白为什么我无法实现与gk110s所指定的并发性。

Please take a look at the following link. 请查看以下链接。 There is an image called kF.png .It shows the profiler output for the streams..!!! 有一个名为kF.png的图像，它显示了流的探查器输出.. !!!

https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/ https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/

2 个解决方案

Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (ie number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. 内核之间的并发性取决于许多因素，但是许多人忽略的一个因素仅仅是内核的大小（即网格中的块数）。具有可以自行有效利用GPU的大小的内核通常不会在很大程度上并发运行，即使这样做，吞吐量优势也很小。 The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap". GPU内的工作分配器通常会在启动内核后立即开始分配块，因此，如果一个内核先于另一个内核启动，并且两个内核都有大量块，则第一个内核通常会占据GPU，直到将近完成后，第二内核的块将被调度并执行，可能会有少量“并发重叠”。

The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. 主要要点是，具有足够块以“填充GPU”的内核将阻止其他内核实际执行，并且除了调度之外，这在计算3.5设备上也没有任何不同。 In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers. 此外，在块级别指定启动参数和统计信息（例如寄存器使用率，共享内存使用率等），不只是为整个内核指定几个参数，还有助于提供清晰的答案。 The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together. 在此领域中，compute 3.5体系结构的优势仍将主要来自“少量”块的“小”内核，它们试图一起执行。 Compute 3.5 has some advantages there. 计算3.5在此具有一些优势。

You should also review the answer to this question . 您还应该查看该问题的答案。

When global memory used by the kernel is not pinned, it affects the speed of data transfer, and also the ability to overlap copy and compute but does not affect the ability of two kernels to execute concurrently. 如果不固定内核使用的全局内存，它将影响数据传输的速度，并影响复制和计算的重叠能力，但不会影响两个内核并发执行的能力。 Nevertheless, the limitation on copy and compute overlap may skew the behavior of your application. 但是，复制和计算重叠的限制可能会使应用程序的行为产生偏差。
There shouldn't be "read conflicts", I'm not sure what you mean by that. 不应有“读取冲突”，我不确定您的意思。 Two independent threads/blocks/grids are allowed to read the same location in global memory. 允许两个独立的线程/块/网格读取全局内存中的相同位置。 Generally this will get sorted out at the L2 cache level. 通常，这将在二级缓存级别得到解决。 As long as we are talking about just reads there should be no conflict, and no particular effect on concurrency. 只要我们谈论的只是阅读，就不应有冲突，并且对并发没有特殊影响。
Constant memory is a limited resource, shared amongst all kernels executing on the device (try running deviceQuery). 恒定内存是有限的资源，在设备上执行的所有内核之间共享（尝试运行deviceQuery）。 If you have not exceeded the total device limit, then the only issue will be one of utilization of the constant cache, and things like cache thrashing. 如果您还没有超出设备总数限制，那么唯一的问题将是恒定高速缓存的利用率以及诸如高速缓存抖动之类的问题。 Apart from this secondary relationship, there is no direct effect on concurrency. 除了这种二级关系，对并发没有直接影响。
It would be more instructive to identify the amount of shared memory per block rather than per kernel. 确定每个块而不是每个内核的共享内存量将更具指导意义。 This will directly affect how many blocks can be scheduled on a SM. 这将直接影响一个SM可以调度多少个块。 But answering this question would be much crisper also if you specified the launch configuration of each kernel, as well as the relative timing of the launch invocations. 但是，如果您指定了每个内核的启动配置以及启动调用的相对时间，那么回答这个问题也将变得更加清晰。 If shared memory happened to be the limiting factor in scheduling, then you can divide the total available shared memory per SM by the amount used by each kernel, to get an idea of the possible concurrency based on this. 如果共享内存恰好是调度的限制因素，那么您可以将每个SM的可用共享内存总数除以每个内核使用的数量，以据此了解可能的并发性。 My own opinion is that number of blocks in each grid is likely to be a bigger issue, unless you have kernels that use 10k per grid but only have a few blocks in the whole grid. 我个人的观点是，每个网格中的块数可能是一个更大的问题，除非您的内核每个网格使用10k，但整个网格中只有几个块。
My comments here would be nearly the same as my response to 4. Take a look at deviceQuery for your device, and if registers became a limiting factor in scheduling blocks on each SM, then you could divide available registers per SM by the register usage per kernel (again, it makes a lot more sense to talk about register usage per block and the number of blocks in the kernel) to discover what the limit might be. 我在这里的评论与我对4的回答几乎相同。看一下设备的deviceQuery，如果寄存器成为每个SM上调度块的限制因素，那么您可以将每个SM的可用寄存器除以每个SM的寄存器使用量内核（同样，谈论每个块的寄存器使用情况和内核中的块数量更为有意义）以发现限制可能是多少。

Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels. 同样，如果您具有合理大小的内核（数百个或数千个块或更多），则工作分配器对块的调度很可能将成为内核之间并发量的主要因素。

EDIT: in response to new information posted in the question. 编辑：响应问题中发布的新信息。 I've looked at the kF.png 我看了kF.png

First let's analyze from a blocks per SM perspective. 首先，让我们从每个SM角度的块进行分析。 CC 3.5 allows 16 "open" or currently scheduled blocks per SM. CC 3.5允许每个SM 16个“打开”或当前计划的块。 If you are launching 2 kernels of 61 blocks each, that may well be enough to fill the "ready-to-go" queue on the CC 3.5 device. 如果要启动2个每个61块的内核，这可能足以填补CC 3.5设备上的“准备就绪”队列。 Stated another way, the GPU can handle 2 of these kernels at a time. 换句话说，GPU可以一次处理其中两个内核。 As the blocks of one of those kernels "drains" then another kernel is scheduled by the work distributor. 当其中一个内核的块“填充”时，工作分配器将调度另一个内核。 The blocks of the first kernel "drain" sufficiently in about half the total time, so that the next kernel gets scheduled about halfway through the completion of the first 2 kernels, so at any given point (draw a vertical line on the timeline) you have either 2 or 3 kernels executing simultaneously. 第一个内核的块在总时间的一半左右就“耗尽”了足够的时间，因此，下一个内核的调度时间大约是前两个内核完成的一半，因此在任何给定的点（在时间轴上画一条垂直线）同时执行2个或3个内核。 (The 3rd kernel launched overlaps the first 2 by about 50% according to the graph, I don't agree with your statement that there is a 3ms distance between each successive kernel launch). （根据图表，第三个内核启动与前两个内核重叠约50％，我不同意您的说法，即每个连续的内核启动之间存在3ms的距离）。 If we say that at peak we have 3 kernels scheduled (there are plenty of vertical lines that will intersect 3 kernel timelines) and each kernel has ~60 blocks, then that is about 180 blocks. 如果说在高峰期我们调度了3个内核（有很多垂直线将与3个内核时间线相交），每个内核有〜60个块，那么大约是180个块。 Your K20 has 13 SMs and each SM can have at most 16 blocks scheduled on it . 您的K20有13个SM，每个SM最多可以调度16个块。 This means at peak you have about 180 blocks scheduled (perhaps) vs. a theoretical peak of 16*13 = 208. So you're pretty close to max here, and there's not much more that you could possibly get. 这意味着在高峰时（可能）有大约180个已调度的块，而理论峰值为16 * 13 =208。因此，您在此处非常接近max，并且没有更多的可能。 But maybe you think you're only getting 120/208, I don't know. 但也许您认为您只获得120/208，我不知道。
Now let's take a look from a shared memory perspective. 现在让我们从共享内存的角度来看一下。 A key question is what is the setting of your L1/shared split? 一个关键问题是您的L1 /共享拆分的设置是什么？ I believe it defaults to 48KB of shared memory per SM, but if you've changed this setting that will be pretty important. 我相信它默认为每个SM 48KB共享内存，但是如果您更改了此设置，那将非常重要。 Regardless, according to your statement each block scheduled will use 10KB of shared memory. 无论如何，根据您的声明，每个计划的块将使用10KB的共享内存。 This means we would max out around 4 blocks scheduled per SM, or 4*13 total blocks = 52 blocks max that can be scheduled at any given time. 这意味着每个SM最多可以调度4个块，或者总共4 * 13个块=可以在任何给定时间调度的最大52个块。 You're clearly exceeding this number, so probably I don't have enough information about the shared memory usage by your kernels. 您显然超出了这个数字，所以可能我没有足够的信息来了解您的内核共享内存的使用情况。 If you're really using 10kb/block, this would more or less preclude you from having more than one kernel's worth of threadblocks executing at a time. 如果您确实使用10kb / block，那么或多或少会阻止您一次执行多个内核的线程块。 There could still be some overlap, and I believe this is likely to be the actual limiting factor in your application. 仍然可能存在一些重叠，我相信这很可能是您应用程序中的实际限制因素。 The first kernel of 60 blocks gets scheduled. 排定了60个块的第一个内核。 After a few blocks drain (or perhaps because the 2 kernels were launched close enough together) the second kernel begins to get scheduled, so nearly simultaneously. 在消耗了几个块之后（或者可能是因为两个内核启动得足够近了），第二个内核开始被调度，因此几乎是同时调度的。 Then we have to wait a while for about a kernel's worth of blocks to drain before the 3rd kernel can get scheduled, this may well be at the 50% point as indicated in the timeline. 然后，我们必须等待一段时间才能耗尽内核的大量块，然后才能安排第三个内核，这很可能在时间轴上所示的50％处。

Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. 无论如何，我认为上面的分析1和2清楚地表明，基于内核结构固有的限制，您将从设备中获得了大部分功能。 (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (eg 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously. （我们可以基于寄存器进行类似的分析，以发现这是否是一个重要的限制因素。）关于此声明：“我真的不明白为什么我没有实现与gk110s所指定的并发等效的原因。”我希望您能看到并发规范（例如32个内核）是最大规范，并且在大多数情况下，在达到可以同时执行的最大内核数量限制之前，您将遇到其他某种计算机限制。

EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. 编辑：关于文档和资源，我从格雷格·史密斯上面链接到的答案提供了一些资源链接。 Here are a few more: 这里还有一些：

The C programming guide has a section on Asynchronous Concurrent Execution . C编程指南中有一个关于异步并发执行的部分。
GPU Concurrency and Streams presentation by Dr. Steve Rennich at NVIDIA is on the NVIDIA webinar page NVIDIA网络研讨会页面上的NVIDIA Steve Rennich博士演示了GPU并发和流

My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. 到目前为止，我对HyperQ的经验是内核并行化的2-3（3.5）倍，因为对于一些更复杂的计算，内核通常更大。 With small kernels its a different story, but usually the kernels are more complicated. 对于小内核，情况则不同，但通常内核会更复杂。

This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization. Nvidia在其cuda 5.0文档中也回答了这一问题，即更复杂的内核将减少并行化的数量。

But still, GK110 has a great advantage just allowing this. 但是，GK110仍然具有很大的优势。