[英]Parallel Reduction
I have read the article Optimizing Parallel Reduction in CUDA by Mark Harris, and I found it really very useful, but still I am sometimes unable to understand 1 or 2 concepts.我已经阅读了 Mark Harris 的文章 Optimizing Parallel Reduction in CUDA,我发现它真的非常有用,但有时我仍然无法理解 1 或 2 个概念。 It is written on pg 18:它写在第 18 页:
//First add during load
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
Optimized Code: With 2 loads and 1st add of the reduction:优化代码:有 2 次加载和第一次减少:
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x; ...1
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; ...2
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; ...3
__syncthreads(); ...4
I am unable to understand line 2;我无法理解第 2 行; if I have 256 elements, and if I choose 128 as my blocksize, then why I am multiplying it with 2?如果我有 256 个元素,如果我选择 128 作为我的块大小,那么我为什么要乘以 2? Please explain how to determine the blocksize?请解释如何确定块大小?
Basically, it is performing the operation shown in the picture below:基本上,它正在执行下图所示的操作:
This code is basically saying that half of the threads will performance the reading from global memory and writing to shared memory, as shown in the picture.这段代码基本上是说一半的线程将执行从全局内存读取和写入共享内存的操作,如图所示。
You execute a Kernel, and now you want to reduce some values, you limit the access to the code above to only half of the total of threads running.你执行一个内核,现在你想减少一些值,你将上面代码的访问限制为运行线程总数的一半。 Imagining you have 4 blocks, each one with 512 threads, you limit the code above to only be executed by the first two blocks, and you have a g_idate[4*512]
:假设您有 4 个块,每个块有 512 个线程,您将上面的代码限制为仅由前两个块执行,并且您有一个g_idate[4*512]
:
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
So:所以:
thread 0 of block = 0 will copy the position 0 and 512,
thread 1 of block = 0 position 1 and 513;
thread 511 of block = 0 position 511 and 1023;
thread 0 of block 1 position 1024 and 1536
thread 511 of block = 1 position 1535 and 2047
The blockDim.x*2
is used because each thread will access to position i
and i+blockDim.x
so you need to multiple by 2
to guarantee that the threads on next id
block do not compute the position of g_idata
already computed.使用blockDim.x*2
是因为每个线程都将访问位置i
和i+blockDim.x
因此您需要乘以2
以保证下一个id
块上的线程不会计算已经计算的g_idata
的位置。
In the optimized code you run the kernel with blocks half as large as in the non-optimized implementation.在优化的代码中,您运行内核的块大小是非优化实现中的一半。
Let's call the size of the block in non-optimized code work
, let half of this size be called unit
, and let these sizes have same numerical value for the optimized code as well.让我们将非优化代码work
块的大小称为,让这个大小的一半称为unit
,并让这些大小对于优化代码也具有相同的数值。
In the non-optimized code you run the kernel with as many threads as the work
is, that is blockDim = 2 * unit
.在非优化的代码运行尽可能多线程内核的work
是,这是blockDim = 2 * unit
。 The code in each block just copies part of g_idata
to an array in shared memory, of size 2 * unit
.每个块中的代码只是将g_idata
一部分g_idata
到共享内存中的数组,大小为2 * unit
。
In the optimized code blockDim = unit
, so there are now 1/2 of the threads, and the array in shared memory is 2x smaller.在优化后的代码blockDim = unit
,现在有 1/2 的线程,共享内存中的数组小了 2 blockDim = unit
。 In line 3 first summand comes from even units, while second from odd units.在第 3 行,第一个被加数来自偶数单位,第二个来自奇数单位。 In this way all the data required for reduction is taken into account.这样,所有还原所需的数据都被考虑在内。
Example: If you run non-optimized kernel with blockDim=256=work
(single block, unit=128
), then optimized code has a single block of blockDim=128=unit
.示例:如果您使用blockDim=256=work
(single block, unit=128
) 运行非优化内核,那么优化后的代码只有一个blockDim=128=unit
块。 Since this block gets blockIdx=0
, the *2
does not matter;由于此块获得blockIdx=0
, *2
无关紧要; the first thread does g_idata[0] + g_idata[0 + 128]
.第一个线程执行g_idata[0] + g_idata[0 + 128]
。
If you had 512 elements, and run non-optimized with 2 blocks of size 256 ( work=256
, unit=128
), then optimized code has 2 blocks, but now of size 128. The first thread in second block ( blockIdx=1
) does g_idata[2*128] + g_idata[2*128+128]
.如果您有 512 个元素,并且使用 2 个大小为 256 的块( work=256
, unit=128
)运行未优化,则优化代码有 2 个块,但现在大小为 128。第二个块中的第一个线程( blockIdx=1
) 做g_idata[2*128] + g_idata[2*128+128]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.