简体   繁体   English

OpenGL计算着色器中线程的执行顺序

[英]Execution order of threads in OpenGL compute shader

I am wondering about the execution order of threads in OpenGL. 我想知道OpenGL中线程的执行顺序。

Say I have a mobile GPU that often have n_cores between 8 ... 32 (eg ARM Mali ). 假设我有一个移动GPU,其n_cores通常在8到32之间(例如ARM Mali )。 That means they are different from Nvidia (AMD) warps (wavefronts). 这意味着它们与Nvidia(AMD)扭曲(波阵面)不同。

The reason I am asking is because of following toy example 我问的原因是由于以下玩具示例

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

shared float a[16];

void main() {
    uint tid = gl_GlobalInvocationID.x; // <-- thread id

    // set all a to 0
    if (tid < 16) {
        a[tid] = 0;
    }
    barrier();
    memoryBarrierShared();

    a[tid % 16] += 1;

    barrier();
    memoryBarrierShared();

    float b = 0;
    b = REDUCE(a); // <-- reduction of a array a
}

  • It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition. 碰巧b在执行之间是不同的( glDispatchCompute(1, 100, 1)glDispatchCompute(1, 100, 1) ),就好像存在某种竞争条件一样。

  • I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor). 我不确定工作组中的线程是否真的是并发的(就像流式多处理器中的扭曲一样)。

  • Also how many cores are mapped to work groups/shaders? 还有多少个核心映射到工作组/着色器?

  • What are your thoughts about that? 您对此有何看法? Thanks 谢谢

It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition. 碰巧b在执行之间是不同的( glDispatchCompute(1, 100, 1)glDispatchCompute(1, 100, 1) ),就好像存在某种竞争条件一样。

That's because there is one: 那是因为有一个:

a[tid % 16] += 1;

For a workgroup with a local size of 256, there will be at least two invocations in that workgroup that have the same value of tid % 16 . 对于本地大小为256的工作组,该工作组中将至少有两个调用的tid % 16值相同。 Therefore, those invocations will attempt to manipulate the same index of a . 因此,这些调用将试图操纵同一索引a

Since there are no barriers or any other mechanism to prevent this, then this is a race-condition on the elements of a . 由于没有障碍或任何其他机制,以防止这种情况,那么这是对的元素的竞争条件a And therefore, you get undefined behavior. 因此,您得到未定义的行为。

Now, you could manipulate a through atomic operations : 现在,您可以通过原子操作来操纵a

atomicAdd(a[tid % 16], 1);

That is well-defined behavior. 这是定义明确的行为。


I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor). 我不确定工作组中的线程是否真的是并发的(就像流式多处理器中的扭曲一样)。

This is irrelevant. 这无关紧要。 You must treat them as if they are executed concurrently. 您必须将它们视为同时执行。

Also how many cores are mapped to work groups/shaders? 还有多少个核心映射到工作组/着色器?

Again, essentially irrelevant. 再次,本质上无关紧要。 This matters in terms of performance, but that's mainly about how big to make your local group size. 这在性能方面很重要,但这主要是关于使本地小组规模变大。 But in terms of whether your code works or not, it doesn't matter. 但是就您的代码是否有效而言,这无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM