简体   繁体   中英

Execution order of threads in OpenGL compute shader

I am wondering about the execution order of threads in OpenGL.

Say I have a mobile GPU that often have n_cores between 8 ... 32 (eg ARM Mali ). That means they are different from Nvidia (AMD) warps (wavefronts).

The reason I am asking is because of following toy example

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

shared float a[16];

void main() {
    uint tid = gl_GlobalInvocationID.x; // <-- thread id

    // set all a to 0
    if (tid < 16) {
        a[tid] = 0;
    }
    barrier();
    memoryBarrierShared();

    a[tid % 16] += 1;

    barrier();
    memoryBarrierShared();

    float b = 0;
    b = REDUCE(a); // <-- reduction of a array a
}

  • It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition.

  • I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).

  • Also how many cores are mapped to work groups/shaders?

  • What are your thoughts about that? Thanks

It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition.

That's because there is one:

a[tid % 16] += 1;

For a workgroup with a local size of 256, there will be at least two invocations in that workgroup that have the same value of tid % 16 . Therefore, those invocations will attempt to manipulate the same index of a .

Since there are no barriers or any other mechanism to prevent this, then this is a race-condition on the elements of a . And therefore, you get undefined behavior.

Now, you could manipulate a through atomic operations :

atomicAdd(a[tid % 16], 1);

That is well-defined behavior.


I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).

This is irrelevant. You must treat them as if they are executed concurrently.

Also how many cores are mapped to work groups/shaders?

Again, essentially irrelevant. This matters in terms of performance, but that's mainly about how big to make your local group size. But in terms of whether your code works or not, it doesn't matter.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM