简体   繁体   English

分支差异真的那么糟糕吗?

[英]Is branch divergence really so bad?

I've seen many questions scattered across the Internet about branch divergence, and how to avoid it. 我看到很多问题散布在互联网上,关于分支差异,以及如何避免分歧。 However, even after reading dozens of articles on how CUDA works, I can't seem to see how avoiding branch divergence helps in most cases . 然而,即使在阅读了几篇关于CUDA如何工作的文章之后, 我似乎也无法看到在大多数情况下如何避免分支差异 Before anyone jumps on on me with claws outstretched, allow me to describe what I consider to be "most cases". 在有人用伸出的爪子向我跳起之前,请允许我描述一下我认为是“大多数情况”。

It seems to me that most instances of branch divergence involve a number of truly distinct blocks of code. 在我看来,大多数分支差异实例涉及许多真正不同的代码块。 For example, we have the following scenario: 例如,我们有以下场景:

if (A):
  foo(A)
else:
  bar(B)

If we have two threads that encounter this divergence, thread 1 will execute first, taking path A. Following this, thread 2 will take path B. In order to remove the divergence, we might change the block above to read like this: 如果我们有两个线程遇到这种分歧,线程1将首先执行,采用路径A.接下来,线程2将采用路径B.为了消除分歧,我们可能会将上面的块更改为如下所示:

foo(A)
bar(B)

Assuming it is safe to call foo(A) on thread 2 and bar(B) on thread 1, one might expect performance to improve. 假设在线程2上调用foo(A)和在线程1上调用bar(B)是安全的,可以预期性能会提高。 However, here's the way I see it: 但是,这是我看到它的方式:

In the first case, threads 1 and 2 execute in serial. 在第一种情况下,线程1和2串行执行。 Call this two clock cycles. 调用这两个时钟周期。

In the second case, threads 1 and 2 execute foo(A) in parallel, then execute bar(B) in parallel. 在第二种情况下,线程1和2并行执行foo(A) ,然后并行执行bar(B) This still looks to me like two clock cycles, the difference is that in the former case, if foo(A) involves a read from memory, I imagine thread 2 can begin execution during that latency, which results in latency hiding. 这仍然看起来像两个时钟周期,区别在于前一种情况,如果foo(A)涉及从内存中读取,我想线程2可以在该延迟期间开始执行,这导致延迟隐藏。 If this is the case, the branch divergent code is faster. 如果是这种情况,分支发散代码更快。

You're assuming (at least it's the example you give and the only reference you make) that the only way to avoid branch divergence is to allow all threads to execute all the code. 您假设(至少它是您给出的示例和您做出的唯一引用)避免分支差异的唯一方法是允许所有线程执行所有代码。

In that case I agree there's not much difference. 在那种情况下,我同意没有太大的区别。

But avoiding branch divergence probably has more to do with algorithm re-structuring at a higher level than just the addition or removal of some if statements and making code "safe" to execute in all threads. 但是,避免分支差异可能更多地与更高级别的算法重构相关,而不仅仅是添加或删除一些if语句并使代码在所有线程中“安全”执行。

I'll offer up one example. 我举一个例子。 Suppose I know that odd threads will need to handle the blue component of a pixel and even threads will need to handle the green component: 假设我知道奇数线程需要处理像素的蓝色分量,甚至线程也需要处理绿色分量:

#define N 2 // number of pixel components
#define BLUE 0
#define GREEN 1
// pixel order: px0BL px0GR px1BL px1GR ...


if (threadIdx.x & 1)  foo(pixel(N*threadIdx.x+BLUE));
else                  bar(pixel(N*threadIdx.x+GREEN));

This means that every alternate thread is taking a given path, whether it be foo or bar . 这意味着每个备用线程都采用给定的路径,无论是foo还是bar So now my warp takes twice as long to execute. 所以现在我的warp需要两倍的执行时间。

However, if I rearrange my pixel data so that the color components are contiguous perhaps in chunks of 32 pixels: BL0 BL1 BL2 ... GR0 GR1 GR2 ... 但是,如果我重新排列我的像素数据,使得颜色分量可能是32像素的块:BL0 BL1 BL2 ... GR0 GR1 GR2 ......

I can write similar code: 我可以写类似的代码:

if (threadIdx.x & 32)  foo(pixel(threadIdx.x));
else                   bar(pixel(threadIdx.x));

It still looks like I have the possibility for divergence. 看起来我仍然有分歧的可能性。 But since the divergence happens on warp boundaries, a give warp executes either the if path or the else path, so no actual divergence occurs. 但由于分支发生在经线边界上,因此给定warp执行if路径或else路径,因此不会发生实际的分歧。

This is a trivial example, and probably stupid, but it illustrates that there may be ways to work around warp divergence that don't involve running all the code of all the divergent paths. 这是一个简单的例子,可能是愚蠢的,但它说明了可能有办法解决扭曲分歧,而不涉及运行所有不同路径的所有代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM