简体   繁体   English

计算着色器 - 如何全局同步线程?

[英]Compute shaders - how to globally synchronize threads?

EDIT: I've rephrased the question to make it more general and simplified the code. 编辑:我已经改写了这个问题,使其更加通用并简化了代码。

I'm probably missing something with thread synchronization in compute shaders. 我可能在计算着色器中缺少线程同步的东西。 I have a simple compute shader that does parallel reduction on some numbers and then I need to modify the final sum: 我有一个简单的计算着色器,可以对某些数字进行并行缩减,然后我需要修改最终的总和:

#version 430 core
#define SIZE 256
#define CLUSTERS 5

layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

struct Cluster {
    vec3 cntr;
    uint size;
};
coherent restrict layout(std430, binding = 0) buffer destBuffer {
    Cluster clusters[CLUSTERS];
};
shared uint sizeCache[SIZE];

void main() {
    const ivec2 pos = ivec2(gl_GlobalInvocationID.xy);
    const uint id = pos.y * (gl_WorkGroupSize.x + gl_NumWorkGroups.x) + pos.x;

    if(id < CLUSTERS) {
        clusters[id].size = 0;
    }

    memoryBarrierShared();
    barrier();
    sizeCache[gl_LocalInvocationIndex] = 1;
    int stepv = (SIZE >> 1); 
    while(stepv > 0) { //reduction over data in each working group
        if (gl_LocalInvocationIndex < stepv) {
            sizeCache[gl_LocalInvocationIndex] += sizeCache[gl_LocalInvocationIndex + stepv];
        }
        memoryBarrierShared();
        barrier();
        stepv = (stepv >> 1);
    }
    if (gl_LocalInvocationIndex == 0) {
        atomicAdd(clusters[0].size, sizeCache[0]);
    }

    memoryBarrier();
    barrier();

    if(id == 0) {
        clusters[0].size = 23; //this doesn't do what I would expect
        clusters[1].size = 13; //this works
    }
}

The reduction works and produces correct result . 减少工作并产生正确的结果 If I comment the last condition, the value in clusters[0].size is 262144, which is correct (it is the number of threads). 如果我评论最后一个条件, clusters[0].size的值是262144,这是正确的(它是线程数)。 If I uncomment it, I would expect to get value 23, because as I understand it, the threads after barrier() should be synchronized and after memoryBarrier() all previous changes in memory should be visible. 如果我取消注释它,我希望得到值23,因为根据我的理解, barrier()之后的线程应该同步,并且在memoryBarrier()所有先前的内存更改都应该是可见的。 However it doesn't work, it produces result like 259095.I guess that the value 23 is rewritten by previous atomicAdd from another thread, but I don't understand why. 但是它不起作用,它产生的结果如259095.I我猜测值23由之前的另一个线程的atomicAdd重写,但我不明白为什么。

This is how I read the result on CPU: 这是我在CPU上读取结果的方式:

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);

//currently it dispatches 262144 threads
glDispatchCompute(32, 32, 1);
glCheckError();

glMemoryBarrier(GL_ALL_BARRIER_BITS); //for debug

struct Cl {
    glm::vec3 cntr;
    uint size;
};

glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);

std::vector<Cl> data(5);
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, sizeOfresult, &data[0]);

I have NVIDIA GT630M card and linux with nvidia proprietary driver (331.49). 我有NVIDIA GT630M卡和linux与nvidia专有驱动程序(331.49)。

You can't syncronize threads globally, ie across work-groups. 您无法全局同步线程,即跨工作组。 This is pointed out in the comments by GuyRT. GuyRT的评论指出了这一点。 In your code, one workgroup can hit 在您的代码中,一个工作组可以命中

clusters[0].size = 23;

while another workgroup is happily doing atomic increments. 而另一个工作组正在愉快地进行原子增量。 As it's only the first thread of the first work-group that goes into the if(id==0) block and as most GPUs dispatch work-groups in order then the value will be written once and then incremented many times by other (most) work-groups. 因为它只是第一个工作组的第一个线程进入if(id==0)块,并且大多数GPU按顺序调度工作组,然后该值将被写入一次,然后由其他值增加很多次(大多数)工作组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM