[英]Optimize branched for loop in openCL
I have a kernel code something like this 我有一个像这样的内核代码
__kernel fn(***){
//X,Y would be image cordinates
int x = get_global_id(0);
int y = get_global_id(1);
//Initialize pixel value
int c = -5 + x * dx;
int d = -5 + y * dy;
int k=0;
for(; k< 500; k++){
//Perform Some Calculations using c and d
//Most of the calculations happen here
if(val > threshold)
break;
}
//Write data based on k
out[x*width+j] = k;
}
I've a feeling that as most of the calculations happens inside the for loop, and as the for loop creates a branch, some of the work items in a work group complete their kernel execution and wait for the entire work group to complete. 我感觉大多数计算都发生在for循环中,并且for循环创建了一个分支,工作组中的一些工作项完成了内核执行并等待整个工作组完成。
How can this be optimized if the output is based on the execution counter k? 如果输出基于执行计数器k,如何优化?
The for loop will have a branch even if you remove that 即使你删除它,for循环也会有一个分支
if(val > threshold)
break;
It will be generated by the compiler to see if the loop should be continued or not. 它将由编译器生成,以查看循环是否应该继续。 Though we can remove the extra branch created inside the for loop.
虽然我们可以删除for循环中创建的额外分支。
k += static_cast<int>(val > threshold) * 500;
This will increase k
by 500 if val > threshold
and therefore quit the loop in the same branch that checks if k
has reached the desired value, without an extra branch. 如果
val > threshold
,这将使k
增加500,因此退出同一分支中的循环,检查k
是否已达到所需值,而没有额外的分支。 Depending on how heavy the calculation inside the loop is, this may not matter. 根据循环内部计算的重要程度,这可能无关紧要。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.