如果无法在条件分支内调用__syncthreads，如何减少CUDA？

Question

The reduction method suggested by NVIDIA uses __syncthreads() inside conditional branching eg: NVIDIA建议的简化方法在条件分支内使用__syncthreads()例如：

if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }

or 要么

for (unsigned int s=blockDim.x/2; s>32; s>>=1)
{
    if (tid < s)
        sdata[tid] += sdata[tid + s];
    __syncthreads();
}

In the second example __syncthreads() is inside for loop body, which is also a conditional branch. 在第二个示例中， __syncthreads()位于for循环体内，它也是一个条件分支。

However, a number of questions on SO raise the problem of __syncthreads() inside conditional branches (eg Can I use __syncthreads() after having dropped threads? and conditional syncthreads & deadlock (or not) ), and the answers say that __syncthreads() in conditional branches may lead to a deadlock. 但是，关于SO的许多问题引发了条件分支内的__syncthreads()问题（例如，在删除线程之后可以使用__syncthreads（）吗？以及条件syncthreads和死锁（或不可以）），答案表明__syncthreads()有条件分支可能会导致死锁。 Consequently, reduction method suggested by NVIDIA may deadlock (if believing the documentation on which the answers are based). 因此，NVIDIA建议的简化方法可能会陷入僵局（如果相信答案所依据的文档）。

Furthermore, if _syncthreads() can't be used inside conditional branches, then I'm afraid that many of the basic operations are blocked and reduction is just an example. 此外，如果不能在条件分支内使用_syncthreads() ，那么恐怕许多基本操作都将被阻塞，而减少只是一个示例。

So how to do reduction in CUDA without using __syncthreads() in conditional branches? 那么，如何在条件分支中不使用__syncthreads()的情况下减少CUDA？ Or is it a bug in the documentation? 还是文档中的错误？

Answer 1

The limitation is not 限制不是

__syncthreads cannot be used in conditional branches __syncthreads不能在条件分支中使用

The limitation is 限制是

__syncthreads cannot be used in branches which will not be traversed by all threads at the same time __syncthreads不能在分支中使用，因为分支不会被所有线程同时遍历

Notice that in both the examples you give, __syncthreads is not covered by a condition that would depend on the thread ID (or some per-thread data). 请注意，在您给出的两个示例中， __syncthreads都不会被依赖于线程ID（或某些每个线程数据）的条件所覆盖。 In the first case, blockSize is a template parameter which does not depend on thread ID. 在第一种情况下， blockSize是不依赖线程ID的模板参数。 In the second case, it's likewise after the if . 在第二种情况下，同样在if 。

Yes, the for loop's s > 32 is a condition, but it is a condition whose truth value does not depend on the thread or its data in any way. 是的，for循环的s > 32是一个条件，但它的真值不以任何方式依赖于线程或其数据。 blockdim.x is the same for all threads. 所有线程的blockdim.x都相同。 And all threads will execute exactly the same modifications of s . 并且所有线程将执行与s完全相同的修改。 Which means that all threads will reach the __syncthreads in exactly the same point of their control flow. 这意味着所有线程将在它们的控制流的完全相同的点到达__syncthreads 。 Which is perfectly OK. 这是完全可以的。

The other case, where you cannot use __syncthreads , is a condition which can be true for some threads and false for other ones. 另一种情况是不能使用__syncthreads ，这种情况对于某些线程可能为true，而对于其他线程则为false。 In such case, you have to close all conditions to use __syncthreads . 在这种情况下，必须关闭所有条件才能使用__syncthreads 。 So instead of this: 所以代替这个：

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
  __syncthreads();
  operation2();
}

You must do this: 您必须这样做：

if (threadIdx.x < SOME_CONSTANT)
{
  operation1();
}
__syncthreads();
if (threadIdx.x < SOME_CONSTANT)
{
  operation2();
}

Both of the examples you gave demonstrate this too: the thread-ID-dependent condition is closed before __syncthreads is called. 您提供的两个示例也都说明了这一点：依赖于线程ID的条件在__syncthreads之前关闭。

如果无法在条件分支内调用__syncthreads，如何减少CUDA？

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-09-13 06:47:39

如果无法在条件分支内调用__syncthreads，如何减少CUDA？

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-09-13 06:47:39

解决方案1
5 已采纳 2016-09-13 06:47:39