简体   繁体   English

如何在OpenCL中使用并行约简实现求和?

[英]How to implement summation using parallel reduction in OpenCL?

I'm trying to implement a kernel which does parallel reduction. 我正在尝试实现并行还原的内核。 The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does. 下面的代码有时会起作用,但我无法确定为什么它有时会出错。

__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];

for(int step = workgroup_size/2; step>0; step/=2){
    barrier(CLK_LOCAL_MEM_FENCE);

    if(local_id < step){
    localSum[local_id] += localSum[local_id + step];
    }
}
if(local_id == 0){
    partialSum[get_group_id(0)] = localSum[0];
}}

Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. 本质上,我是对每个工作组的值求和,并将每个工作组的总数存储到partialSum中,最终求和是在主机上完成的。 Below is the code which sets up the values for the summation. 以下是设置求和值的代码。

size_t global[1];
size_t local[1];

const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;

local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));

int i;
for (i = 0; i < count; i++){
    input[i] = (float)i+1;
}

I'm thinking it has something to do when the size of the input is not a power of two? 我在想输入的大小不是2的幂时有事吗? I noticed it begins to go off for numbers around 8000 and beyond. 我注意到它开始出现在8000左右及以后的数字。 Any assistance is welcome. 欢迎任何帮助。 Thanks. 谢谢。

I'm thinking it has something to do when the size of the input is not a power of two? 我在想输入的大小不是2的幂时有事吗?

Yes. 是。 Consider what happens when you try to reduce, say, 9 elements. 考虑一下当您尝试减少9个元素时会发生什么。 Suppose you launch 1 work-group of 9 work-items: 假设您启动了一个包含9个工作项的工作组:

for (int step = workgroup_size / 2; step > 0; step /= 2){
    // At iteration 0: step = 9 / 2 = 4
    barrier(CLK_LOCAL_MEM_FENCE);

    if (local_id < step) {
        // Branch taken by threads 0 to 3
        // Only 8 numbers added up together! 
        localSum[local_id] += localSum[local_id + step];
    }
}

You're never summing the 9th element, hence the reduction is incorrect. 您永远不会求和第9个元素,因此减少是不正确的。 An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two. 一个简单的解决方案是用足够的零填充输入数据,以使工作组大小立即成为下一个2的幂。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM