简体   繁体   English

在OpenCL中使用本地内存并行减少

[英]Parallel reduction using local memory in OpenCL

I implemented a reduce kernel in OpenCL to sum up all entries in the input vector of size N . 我在OpenCL中实现了reduce内核,以求和大小为Ninput向量中的所有条目。 For a easier testing I initialize the input vector with 1.0f . 为了简化测试,我使用1.0f初始化input向量。 So the result should be N . 因此结果应为N But it is not! 但这不是!

Here is my reduce -kernel: 这是我的reduce -kernel:

kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
    const uint local_id = get_local_id(0);
    const uint global_id = get_global_id(0);
    const uint local_size = get_local_size(0);

    cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
        if (local_id < s) {
            cache[local_id] += cache[local_id + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (local_id == 0) output[local_size] = cache[0];
}

And here is the setting for OpenCL: 这是OpenCL的设置:

 const uint N = 8196;

 cl_float a[N];
 cl_float b[N];

 for (uint i=0; i<N; i++) {
      a[i] = 1.0f;
      b[i] = 0.0f;
 }

 cl::Buffer inputBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*N);
 cl::Buffer resultBuffer(context, CL_MEM_READ_ONLY, sizeof(cl_float)*N);

 queue.enqueueWriteBuffer(inputBuffer, CL_TRUE, 0, sizeof(cl_float)*N, a);
 queue.enqueueWriteBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b);

 cl::Kernel addVectorKernel = cl::Kernel(program, "reduce");

 size_t localSize = addVectorKernel.getWorkGroupInfo<CL_KERNEL_WORK_GROUP_SIZE>(device); // e.g. => 512

 size_t globalSize = roundUp(localSize, N); // rounds up to a multiple of localSize

 addVectorKernel.setArg(0, inputBuffer);
 addVectorKernel.setArg(1, resultBuffer);
 addVectorKernel.setArg(2, N);
 addVectorKernel.setArg(3, (sizeof(cl_float) * localSize), NULL);


 queue.enqueueNDRangeKernel(
      addVectorKernel,
      cl::NullRange,    
      cl::NDRange(globalSize), 
      cl::NDRange(localSize)     
 );
 queue.finish(); // wait for ending

 queue.enqueueReadBuffer(resultBuffer, CL_TRUE, 0, sizeof(cl_float)*N, b); // e.g. => 1024

The result depends on the workgroup size. 结果取决于工作组的大小。 What am I doing wrong? 我究竟做错了什么? Is it the kernel itself or is it the settings for OpenCL? 是内核本身还是OpenCL的设置?

You should be using the group's id when writing the sum back to global memory. 将总和写回到全局内存时,应该使用组的ID。

if (local_id == 0) output[local_size] = cache[0];

That line will write to output[512] repeatedly. 该行将重复写入output [512]。 You need each work group to write to a dedicated location in the output. 您需要每个工作组写入输出中的专用位置。

kernel void reduce(global float* input, global float* output, const unsigned int N, local float* cache)
{
    const uint local_id = get_local_id(0);
    const uint global_id = get_global_id(0);
    const uint group_id = get_group_id(0);
    const uint local_size = get_local_size(0);

    cache[local_id] = (global_id < N) ? input[global_id] : 0.0f;
    barrier(CLK_LOCAL_MEM_FENCE);

    for (unsigned int s = local_size >> 1; s > 0; s >>= 1) {
        if (local_id < s) {
            cache[local_id] += cache[local_id + s];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if (local_id == 0) output[group_id] = cache[0];
}

Then you need to sum the values from the output on the host. 然后,您需要对主机上输出的值求和。 Note that 'b' in the host code does not need to hold N elements. 注意,主机代码中的“ b”不需要包含N个元素。 Only one element for each work group will be used. 每个工作组仅使用一个元素。

//replace (globalSize/localSize) with the pre-calculated/known number of work groups
for (i=1; i<(globalSize/localSize); i++) {
    b[0] += b[i];
}

Now b[0] is your grand total. 现在b [0]是您的总计。

In the reduction for loop, you need this: 在reduce for循环中,您需要这样做:

for(unsigned int s = localSize >> 1; s > 0; s >>= 1)

You are shifting one more bit than you should when initializing s. 与初始化s相比,您的位移要大得多。

After that's fixed, let's look at what your kernel is doing. 解决此问题之后,让我们看看您的内核正在做什么。 The host code executes it with globalSize of 8192 and localSize of 512, which results in 16 work groups. 主机代码使用8192的globalSize和512的localSize来执行它,这将导致16个工作组。 Inside the kernel you first sum the data from the two consecutive memory locations at index 2*global_id. 首先,在内核内部,将来自两个连续内存位置的数据求和为索引2 * global_id。 For work group with id 15, work item 0, that will be at index 15*512*2 = 15,360 and 15,361, which is outside the boundaries of your input array. 对于ID为15的工作组,工作项0的索引为15 * 512 * 2 = 15,360和15,361,位于输入数组的边界之外。 I am surprised you don't get a crash. 我很惊讶您没有崩溃。 At the same time, this explains why you have double the values that you expect. 同时,这解释了为什么您将期望值加倍。

To fix it, you can do this: 要解决此问题,您可以执行以下操作:

cache[localID] = input[globalID];

Or specify a global size that's half of the number of the current one. 或者指定一个全局大小,该大小是当前大小的一半。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM