[英]Compute shaders - how to globally synchronize threads?
EDIT: I've rephrased the question to make it more general and simplified the code. 编辑:我已经改写了这个问题,使其更加通用并简化了代码。
I'm probably missing something with thread synchronization in compute shaders. 我可能在计算着色器中缺少线程同步的东西。 I have a simple compute shader that does parallel reduction on some numbers and then I need to modify the final sum:
我有一个简单的计算着色器,可以对某些数字进行并行缩减,然后我需要修改最终的总和:
#version 430 core
#define SIZE 256
#define CLUSTERS 5
layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;
struct Cluster {
vec3 cntr;
uint size;
};
coherent restrict layout(std430, binding = 0) buffer destBuffer {
Cluster clusters[CLUSTERS];
};
shared uint sizeCache[SIZE];
void main() {
const ivec2 pos = ivec2(gl_GlobalInvocationID.xy);
const uint id = pos.y * (gl_WorkGroupSize.x + gl_NumWorkGroups.x) + pos.x;
if(id < CLUSTERS) {
clusters[id].size = 0;
}
memoryBarrierShared();
barrier();
sizeCache[gl_LocalInvocationIndex] = 1;
int stepv = (SIZE >> 1);
while(stepv > 0) { //reduction over data in each working group
if (gl_LocalInvocationIndex < stepv) {
sizeCache[gl_LocalInvocationIndex] += sizeCache[gl_LocalInvocationIndex + stepv];
}
memoryBarrierShared();
barrier();
stepv = (stepv >> 1);
}
if (gl_LocalInvocationIndex == 0) {
atomicAdd(clusters[0].size, sizeCache[0]);
}
memoryBarrier();
barrier();
if(id == 0) {
clusters[0].size = 23; //this doesn't do what I would expect
clusters[1].size = 13; //this works
}
}
The reduction works and produces correct result . 减少工作并产生正确的结果 。 If I comment the last condition, the value in
clusters[0].size
is 262144, which is correct (it is the number of threads). 如果我评论最后一个条件,
clusters[0].size
的值是262144,这是正确的(它是线程数)。 If I uncomment it, I would expect to get value 23, because as I understand it, the threads after barrier()
should be synchronized and after memoryBarrier()
all previous changes in memory should be visible. 如果我取消注释它,我希望得到值23,因为根据我的理解,
barrier()
之后的线程应该同步,并且在memoryBarrier()
所有先前的内存更改都应该是可见的。 However it doesn't work, it produces result like 259095.I guess that the value 23 is rewritten by previous atomicAdd
from another thread, but I don't understand why. 但是它不起作用,它产生的结果如259095.I我猜测值23由之前的另一个线程的
atomicAdd
重写,但我不明白为什么。
This is how I read the result on CPU: 这是我在CPU上读取结果的方式:
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);
//currently it dispatches 262144 threads
glDispatchCompute(32, 32, 1);
glCheckError();
glMemoryBarrier(GL_ALL_BARRIER_BITS); //for debug
struct Cl {
glm::vec3 cntr;
uint size;
};
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, resultBuffer);
std::vector<Cl> data(5);
glGetBufferSubData(GL_SHADER_STORAGE_BUFFER, 0, sizeOfresult, &data[0]);
I have NVIDIA GT630M card and linux with nvidia proprietary driver (331.49). 我有NVIDIA GT630M卡和linux与nvidia专有驱动程序(331.49)。
You can't syncronize threads globally, ie across work-groups. 您无法全局同步线程,即跨工作组。 This is pointed out in the comments by GuyRT.
GuyRT的评论指出了这一点。 In your code, one workgroup can hit
在您的代码中,一个工作组可以命中
clusters[0].size = 23;
while another workgroup is happily doing atomic increments. 而另一个工作组正在愉快地进行原子增量。 As it's only the first thread of the first work-group that goes into the
if(id==0)
block and as most GPUs dispatch work-groups in order then the value will be written once and then incremented many times by other (most) work-groups. 因为它只是第一个工作组的第一个线程进入
if(id==0)
块,并且大多数GPU按顺序调度工作组,然后该值将被写入一次,然后由其他值增加很多次(大多数)工作组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.