[英]How to reduce in CUDA if __syncthreads can't be called inside conditional branches?
The reduction method suggested by NVIDIA uses __syncthreads()
inside conditional branching eg: NVIDIA建议的简化方法在条件分支内使用
__syncthreads()
例如:
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
or 要么
for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
In the second example __syncthreads()
is inside for
loop body, which is also a conditional branch. 在第二个示例中,
__syncthreads()
位于for
循环体内,它也是一个条件分支。
However, a number of questions on SO raise the problem of __syncthreads()
inside conditional branches (eg Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that __syncthreads()
in conditional branches may lead to a deadlock. 但是,关于SO的许多问题引发了条件分支内的
__syncthreads()
问题(例如,在删除线程之后可以使用__syncthreads()吗?以及条件syncthreads和死锁(或不可以) ),答案表明__syncthreads()
有条件分支可能会导致死锁。 Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based). 因此,NVIDIA建议的简化方法可能会陷入僵局(如果相信答案所依据的文档)。
Furthermore, if _syncthreads()
can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example. 此外,如果不能在条件分支内使用
_syncthreads()
,那么恐怕许多基本操作都将被阻塞,而减少只是一个示例。
So how to do reduction in CUDA without using __syncthreads()
in conditional branches? 那么,如何在条件分支中不使用
__syncthreads()
的情况下减少CUDA? Or is it a bug in the documentation? 还是文档中的错误?
The limitation is not 限制不是
__syncthreads
cannot be used in conditional branches__syncthreads
不能在条件分支中使用
The limitation is 限制是
__syncthreads
cannot be used in branches which will not be traversed by all threads at the same time__syncthreads
不能在分支中使用,因为分支不会被所有线程同时遍历
Notice that in both the examples you give, __syncthreads
is not covered by a condition that would depend on the thread ID (or some per-thread data). 请注意,在您给出的两个示例中,
__syncthreads
都不会被依赖于线程ID(或某些每个线程数据)的条件所覆盖。 In the first case, blockSize
is a template parameter which does not depend on thread ID. 在第一种情况下,
blockSize
是不依赖线程ID的模板参数。 In the second case, it's likewise after the if
. 在第二种情况下,同样在
if
。
Yes, the for loop's s > 32
is a condition, but it is a condition whose truth value does not depend on the thread or its data in any way. 是的,for循环的
s > 32
是一个条件,但它的真值不以任何方式依赖于线程或其数据。 blockdim.x
is the same for all threads. 所有线程的
blockdim.x
都相同。 And all threads will execute exactly the same modifications of s
. 并且所有线程将执行与
s
完全相同的修改。 Which means that all threads will reach the __syncthreads
in exactly the same point of their control flow. 这意味着所有线程将在它们的控制流的完全相同的点到达
__syncthreads
。 Which is perfectly OK. 这是完全可以的。
The other case, where you cannot use __syncthreads
, is a condition which can be true for some threads and false for other ones. 另一种情况是不能使用
__syncthreads
,这种情况对于某些线程可能为true,而对于其他线程则为false。 In such case, you have to close all conditions to use __syncthreads
. 在这种情况下,必须关闭所有条件才能使用
__syncthreads
。 So instead of this: 所以代替这个:
if (threadIdx.x < SOME_CONSTANT)
{
operation1();
__syncthreads();
operation2();
}
You must do this: 您必须这样做:
if (threadIdx.x < SOME_CONSTANT)
{
operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
operation2();
}
Both of the examples you gave demonstrate this too: the thread-ID-dependent condition is closed before __syncthreads
is called. 您提供的两个示例也都说明了这一点:依赖于线程ID的条件在
__syncthreads
之前关闭。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.