Compute shaders - how to globally synchronize threads?

Question

EDIT: I've rephrased the question to make it more general and simplified the code.

I'm probably missing something with thread synchronization in compute shaders. I have a simple compute shader that does parallel reduction on some numbers and then I need to modify the final sum:

#version 430 core
#define SIZE 256
#define CLUSTERS 5

layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

struct Cluster {
    vec3 cntr;
    uint size;
};
coherent restrict layout(std430, binding = 0) buffer destBuffer {
    Cluster clusters[CLUSTERS];
};
shared uint sizeCache[SIZE];

void main() {
    const ivec2 pos = ivec2(gl_GlobalInvocationID.xy);
    const uint id = pos.y * (gl_WorkGroupSize.x + gl_NumWorkGroups.x) + pos.x;

    if(id < CLUSTERS) {
        clusters[id].size = 0;
    }

    memoryBarrierShared();
    barrier();
    sizeCache[gl_LocalInvocationIndex] = 1;
    int stepv = (SIZE >> 1); 
    while(stepv > 0) { //reduction over data in each working group
        if (gl_LocalInvocationIndex < stepv) {
            sizeCache[gl_LocalInvocationIndex] += sizeCache[gl_LocalInvocationIndex + stepv];
        }
        memoryBarrierShared();
        barrier();
        stepv = (stepv >> 1);
    }
    if (gl_LocalInvocationIndex == 0) {
        atomicAdd(clusters[0].size, sizeCache[0]);
    }

    memoryBarrier();
    barrier();

    if(id == 0) {
        clusters[0].size = 23; //this doesn't do what I would expect
        clusters[1].size = 13; //this works
    }
}

The reduction works and produces correct result . If I comment the last condition, the value in clusters[0].size is 262144, which is correct (it is the number of threads). If I uncomment it, I would expect to get value 23, because as I understand it, the threads after barrier() should be synchronized and after memoryBarrier() all previous changes in memory should be visible. However it doesn't work, it produces result like 259095.I guess that the value 23 is rewritten by previous atomicAdd from another thread, but I don't understand why.

This is how I read the result on CPU:

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);

//currently it dispatches 262144 threads
glDispatchCompute(32, 32, 1);
glCheckError();

glMemoryBarrier(GL_ALL_BARRIER_BITS); //for debug

struct Cl {
    glm::vec3 cntr;
    uint size;
};

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);

std::vector<Cl> data(5);
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, sizeOfresult, &data[0]);

I have NVIDIA GT630M card and linux with nvidia proprietary driver (331.49).

Answer 1

You can't syncronize threads globally, ie across work-groups. This is pointed out in the comments by GuyRT. In your code, one workgroup can hit

clusters[0].size = 23;

while another workgroup is happily doing atomic increments. As it's only the first thread of the first work-group that goes into the if(id==0) block and as most GPUs dispatch work-groups in order then the value will be written once and then incremented many times by other (most) work-groups.

Compute shaders - how to globally synchronize threads?

Question

1 answers

solution1
2 2014-07-01 14:03:08

Compute shaders - how to globally synchronize threads?

Question

1 answers

solution1 2 2014-07-01 14:03:08

solution1
2 2014-07-01 14:03:08