Execution order of threads in OpenGL compute shader

Question

I am wondering about the execution order of threads in OpenGL.

Say I have a mobile GPU that often have n_cores between 8 ... 32 (eg ARM Mali ). That means they are different from Nvidia (AMD) warps (wavefronts).

The reason I am asking is because of following toy example

layout(local_size_x = 256, local_size_y = 1, local_size_z = 1) in;

shared float a[16];

void main() {
    uint tid = gl_GlobalInvocationID.x; // <-- thread id

    // set all a to 0
    if (tid < 16) {
        a[tid] = 0;
    }
    barrier();
    memoryBarrierShared();

    a[tid % 16] += 1;

    barrier();
    memoryBarrierShared();

    float b = 0;
    b = REDUCE(a); // <-- reduction of a array a
}

It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition.
I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).
Also how many cores are mapped to work groups/shaders?
What are your thoughts about that? Thanks

Answer 1

It happens that b is different from execution to execution ( glDispatchCompute(1, 100, 1) ) as if there is some race condition.

That's because there is one:

a[tid % 16] += 1;

For a workgroup with a local size of 256, there will be at least two invocations in that workgroup that have the same value of tid % 16 . Therefore, those invocations will attempt to manipulate the same index of a .

Since there are no barriers or any other mechanism to prevent this, then this is a race-condition on the elements of a . And therefore, you get undefined behavior.

Now, you could manipulate a through atomic operations :

atomicAdd(a[tid % 16], 1);

That is well-defined behavior.

I am not sure wether threads within a work group are really concurrent (like warps in a streaming multiprocessor).

This is irrelevant. You must treat them as if they are executed concurrently.

Also how many cores are mapped to work groups/shaders?

Again, essentially irrelevant. This matters in terms of performance, but that's mainly about how big to make your local group size. But in terms of whether your code works or not, it doesn't matter.

Execution order of threads in OpenGL compute shader

Question

1 answers

solution1
3 ACCPTED 2017-01-16 01:15:53

Execution order of threads in OpenGL compute shader

Question

1 answers

solution1 3 ACCPTED 2017-01-16 01:15:53

solution1
3 ACCPTED 2017-01-16 01:15:53