简体   繁体   English

更改工作组尺寸时无法通过AMD样本减少进行Opencl Sum Reduction

[英]Can't make work Opencl Sum Reduction from AMD sample reduction when changing WorkGroups dimension

The following code comes from the amd website 以下代码来自amd网站

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = (mine < other) ? mine : other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
     result[get_group_id(0)] = scratch[0];
  }
}

I adapted it to make it work as a sum reduction: 我对其进行了修改,以使其总和减少:

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = 0.0;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = accumulator + element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  int local_index = get_local_id(0);
  scratch[local_index] = accumulator;
  barrier(CLK_LOCAL_MEM_FENCE);
  for(int offset = get_local_size(0) / 2;
      offset > 0;
      offset = offset / 2) {
    if (local_index < offset) {
      float other = scratch[local_index + offset];
      float mine = scratch[local_index];
      scratch[local_index] = mine + other;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  if (local_index == 0) {
     result[get_group_id(0)] = scratch[0];
  }
}

And it works like a charm when I use one only work group (meaning that i give NULL as local_work_size to clEnqueueNDRangeKernel() ), but things get out of my control when I try to change the workgroup dimension. 当我只使用一个工作组时,它的工作原理就像一种魅力(这意味着我将NULL作为local_work_size NULL local_work_sizeclEnqueueNDRangeKernel() ),但是当我尝试更改工作组尺寸时,事情变得无法控制。 (I should say I am a total newbie in OpenCl) (我应该说我是OpenCl的新手)

What I do is as follows 我的工作如下

#define GLOBAL_DIM 600
#define WORK_DIM 60

size_t global_1D[3] = {GLOBAL_DIM,1,1};
size_t work_dim[3] = {WORK_DIM,1,1};
err = clEnqueueNDRangeKernel(commands, av_velocity_kernel, 1, NULL, global_1D, work_dim, 0, NULL, NULL); //TODO CHECK THIS LINE
if (err)    {
  printf("Error: Failed to execute av_velocity_kernel!\n");            printf("\n%s",err_code(err));   fflush(stdout);      return EXIT_FAILURE;    }

Do I do it the wrong way?? 我做错了吗?

Furthermore, I noticed that if I set #define GLOBAL_DIM 60000 (which is what I would need) I run out of local memory. 此外,我注意到,如果我设置#define GLOBAL_DIM 60000 (这是我需要的),则会用完本地内存。 DO I get "more" local memory if I use several work groups, or the local memory is evenly spread between workgroups?? 如果我使用多个工作组,或者本地内存在工作组之间平均分配,我会得到“更多”的本地内存吗?

First of all, those reduction kernels only work correctly if the workgroup size is a power of two. 首先,只有工作组大小为2的幂时,这些归约内核才能正常工作。 This means that instead of 60 you should use something 64. Also, there is no way that changing the GLOBAL_DIM makes you run out of local memory: you're most probably doing something wrong when invoking the kernel. 这意味着您应该使用64而不是60。此外,更改GLOBAL_DIM也不会使您用尽本地内存:调用内核时您很可能在做错事。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM