[英]Cuda Stream Processing for multiple kernels Disambiguation
Hi a few questions regarding Cuda stream processing for multiple kernels. 您好,一些有关多个内核的Cuda流处理的问题。 Assume s streams and a kernels in a 3.5 capable kepler device, where s <= 32. kernel uses a dev_input array of size n and a dev output array of size s*n.
假设s流和具有3.5功能的开普勒设备中的内核,其中s <=32。内核使用大小为n的dev_input数组和大小为s * n的dev输出数组。 kernel reads data from input array, stores its value in a register, manipulates it and writes its result back to dev_output at the position s*n + tid.
内核从输入数组中读取数据,将其值存储在寄存器中,对其进行操作,然后将其结果写回到s * n + tid处的dev_output中。
We aim to run the same kernel s times using one of the n streams each time. 我们的目标是每次使用n个流之一运行相同的内核s次。 Similar to the simpleHyperQ example.
与simpleHyperQ示例相似。 Can you comment if and how any of the following affects concurrency please?
您能否评论以下任何一项以及如何影响并发?
Any good comments will be appreciated...!!! 任何好的评论将不胜感激... !!!
cheers, Thanasio 干杯,塔纳西奥
Robert, thanks a lot for your detailed answer. 罗伯特,非常感谢您的详细回答。 It has been very helpful.
这非常有帮助。 I edited 4, it is 10kb per block.
我编辑了4,每个块10kb。 So in my situation, i launch grids of 61 blocks and 256 threads.
因此,在我的情况下,我启动了包含61个块和256个线程的网格。 The kernels are rather computationally bound.
内核相当受计算约束。 I launch 8 streams of the same kernel.
我启动了8个相同内核的流。 Profile them and then i see a very good overlap between the first two and then it gets worse and worse.
剖析它们,然后我发现前两个之间有很好的重叠,然后变得越来越糟。 The kernel execution time is around 6ms.
内核执行时间约为6ms。 After the first two streams execute almost perfectly concurrent the rest have a 3ms distance between them.
前两个流几乎完全并发执行后,其余两个流之间的距离为3ms。 Regarding 5, i use a K20 which has a 255 register file.
关于5,我使用的K20具有255个寄存器文件。 So i would not expect drawbacks from there.
因此,我不会期望那里有弊端。 I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s..
我真的不明白为什么我无法实现与gk110s所指定的并发性。
Please take a look at the following link. 请查看以下链接。 There is an image called kF.png .It shows the profiler output for the streams..!!!
有一个名为kF.png的图像,它显示了流的探查器输出.. !!!
https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/ https://devtalk.nvidia.com/default/topic/531740/cuda-programming-and-performance/concurrent-streams-and-hyperq-for-k20/
Concurrency amongst kernels depends upon a number of factors, but one that many people overlook is simply the size of the kernel (ie number of blocks in the grid.) Kernels that are of a size that can effectively utilize the GPU by themselves will not generally run concurrently to a large degree, and there would be little throughput advantage even if they did. 内核之间的并发性取决于许多因素,但是许多人忽略的一个因素仅仅是内核的大小(即网格中的块数)。具有可以自行有效利用GPU的大小的内核通常不会在很大程度上并发运行,即使这样做,吞吐量优势也很小。 The work distributor inside the GPU will generally begin distributing blocks as soon as a kernel is launched, so if one kernel is launched before another, and both have a large number of blocks, then the first kernel will generally occupy the GPU until it is nearly complete, at which point blocks of the second kernel will then get scheduled and executed, perhaps with a small amount of "concurrent overlap".
GPU内的工作分配器通常会在启动内核后立即开始分配块,因此,如果一个内核先于另一个内核启动,并且两个内核都有大量块,则第一个内核通常会占据GPU,直到将近完成后,第二内核的块将被调度并执行,可能会有少量“并发重叠”。
The main point is that kernels that have enough blocks to "fill up the GPU" will prevent other kernels from actually executing, and apart from scheduling, this isn't any different on a compute 3.5 device. 主要要点是,具有足够块以“填充GPU”的内核将阻止其他内核实际执行,并且除了调度之外,这在计算3.5设备上也没有任何不同。 In addition, rather than just specifying a few parameters for the kernel as a whole, also specifying launch parameters and statistics (such as register usage, shared mem usage, etc.) at the block level are helpful for providing crisp answers.
此外,在块级别指定启动参数和统计信息(例如寄存器使用率,共享内存使用率等),不只是为整个内核指定几个参数,还有助于提供清晰的答案。 The benefits of the compute 3.5 architecture in this area will still mainly come from "small" kernels of "few" blocks, attempting to execute together.
在此领域中,compute 3.5体系结构的优势仍将主要来自“少量”块的“小”内核,它们试图一起执行。 Compute 3.5 has some advantages there.
计算3.5在此具有一些优势。
You should also review the answer to this question . 您还应该查看该问题的答案。
Again, if you have reasonable sized kernels (hundreds or thousands of blocks, or more) then the scheduling of blocks by the work distributor is most likely going to be the dominant factor in the amount of concurrency between kernels. 同样,如果您具有合理大小的内核(数百个或数千个块或更多),则工作分配器对块的调度很可能将成为内核之间并发量的主要因素。
EDIT: in response to new information posted in the question. 编辑:响应问题中发布的新信息。 I've looked at the kF.png
我看了kF.png
Anyway I think the analyses 1 and 2 above clearly suggest you're getting most of the capability out of the device, based on the limitations inherent in your kernel structure. 无论如何,我认为上面的分析1和2清楚地表明,基于内核结构固有的限制,您将从设备中获得了大部分功能。 (We could do a similar analysis based on registers to discover if that is a significant limiting factor.) Regarding this statement: "I really cannot understand why i do not achieve concurrency equivalent to what is specified for gk110s.." I hope you see that the concurrency spec (eg 32 kernels) is a maximum spec, and in most cases you are going to run into some other kind of machine limit before you hit the limit on the maximum number of kernels that can execute simultaneously.
(我们可以基于寄存器进行类似的分析,以发现这是否是一个重要的限制因素。)关于此声明:“我真的不明白为什么我没有实现与gk110s所指定的并发等效的原因。”我希望您能看到并发规范(例如32个内核)是最大规范,并且在大多数情况下,在达到可以同时执行的最大内核数量限制之前,您将遇到其他某种计算机限制。
EDIT: regarding documentation and resources, the answer I linked to above from Greg Smith provides some resource links. 编辑:关于文档和资源,我从格雷格·史密斯上面链接到的答案提供了一些资源链接。 Here are a few more:
这里还有一些:
My experience with HyperQ so far is 2-3 (3.5) times parallellization of my kernels, as the kernels usually are larger for a little more complex calculations. 到目前为止,我对HyperQ的经验是内核并行化的2-3(3.5)倍,因为对于一些更复杂的计算,内核通常更大。 With small kernels its a different story, but usually the kernels are more complicated.
对于小内核,情况则不同,但通常内核会更复杂。
This is also answered by Nvidia in their cuda 5.0 documentation that more complex kernels will take down the amount of parallellization. Nvidia在其cuda 5.0文档中也回答了这一问题,即更复杂的内核将减少并行化的数量。
But still, GK110 has a great advantage just allowing this. 但是,GK110仍然具有很大的优势。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.