简体   繁体   English

如果无法在条件分支内调用__syncthreads,如何减少CUDA?

[英]How to reduce in CUDA if __syncthreads can't be called inside conditional branches?

The reduction method suggested by NVIDIA uses __syncthreads() inside conditional branching eg: NVIDIA建议的简化方法在条件分支内使用__syncthreads()例如:

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

or 要么

for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
    if (tid < s)
        sdata[tid] += sdata[tid + s];
    __syncthreads();
}

In the second example __syncthreads() is inside for loop body, which is also a conditional branch. 在第二个示例中, __syncthreads()位于for循环体内,它也是一个条件分支。

However, a number of questions on SO raise the problem of __syncthreads() inside conditional branches (eg Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that __syncthreads() in conditional branches may lead to a deadlock. 但是,关于SO的许多问题引发了条件分支内的__syncthreads()问题(例如,在删除线程之后可以使用__syncthreads()吗?以及条件syncthreads和死锁(或不可以) ),答案表明__syncthreads()有条件分支可能会导致死锁。 Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based). 因此,NVIDIA建议的简化方法可能会陷入僵局(如果相信答案所依据的文档)。

Furthermore, if _syncthreads() can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example. 此外,如果不能在条件分支内使用_syncthreads() ,那么恐怕许多基本操作都将被阻塞,而减少只是一个示例。

So how to do reduction in CUDA without using __syncthreads() in conditional branches? 那么,如何在条件分支中不使用__syncthreads()的情况下减少CUDA? Or is it a bug in the documentation? 还是文档中的错误?

The limitation is not 限制不是

__syncthreads cannot be used in conditional branches __syncthreads不能在条件分支中使用

The limitation is 限制是

__syncthreads cannot be used in branches which will not be traversed by all threads at the same time __syncthreads不能在分支中使用,因为分支不会被所有线程同时遍历

Notice that in both the examples you give, __syncthreads is not covered by a condition that would depend on the thread ID (or some per-thread data). 请注意,在您给出的两个示例中, __syncthreads都不会被依赖于线程ID(或某些每个线程数据)的条件所覆盖。 In the first case, blockSize is a template parameter which does not depend on thread ID. 在第一种情况下, blockSize是不依赖线程ID的模板参数。 In the second case, it's likewise after the if . 在第二种情况下,同样在if

Yes, the for loop's s > 32 is a condition, but it is a condition whose truth value does not depend on the thread or its data in any way. 是的,for循环的s > 32是一个条件,但它的真值不以任何方式依赖于线程或其数据。 blockdim.x is the same for all threads. 所有线程的blockdim.x都相同。 And all threads will execute exactly the same modifications of s . 并且所有线程将执行与s完全相同的修改。 Which means that all threads will reach the __syncthreads in exactly the same point of their control flow. 这意味着所有线程将在它们的控制流的完全相同的点到达__syncthreads Which is perfectly OK. 这是完全可以的。

The other case, where you cannot use __syncthreads , is a condition which can be true for some threads and false for other ones. 另一种情况是不能使用__syncthreads ,这种情况对于某些线程可能为true,而对于其他线程则为false。 In such case, you have to close all conditions to use __syncthreads . 在这种情况下,必须关闭所有条件才能使用__syncthreads So instead of this: 所以代替这个:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
  __syncthreads();
  operation2();
}

You must do this: 您必须这样做:

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
  operation2();
}

Both of the examples you gave demonstrate this too: the thread-ID-dependent condition is closed before __syncthreads is called. 您提供的两个示例也都说明了这一点:依赖于线程ID的条件在__syncthreads之前关闭。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM