[英]Can't make work Opencl Sum Reduction from AMD sample reduction when changing WorkGroups dimension
以下代码来自amd网站
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
我对其进行了修改,以使其总和减少:
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = 0.0;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = accumulator + element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;
offset = offset / 2) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = mine + other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
当我只使用一个工作组时,它的工作原理就像一种魅力(这意味着我将NULL
作为local_work_size
NULL
local_work_size
给clEnqueueNDRangeKernel()
),但是当我尝试更改工作组尺寸时,事情变得无法控制。 (我应该说我是OpenCl的新手)
我的工作如下
#define GLOBAL_DIM 600
#define WORK_DIM 60
size_t global_1D[3] = {GLOBAL_DIM,1,1};
size_t work_dim[3] = {WORK_DIM,1,1};
err = clEnqueueNDRangeKernel(commands, av_velocity_kernel, 1, NULL, global_1D, work_dim, 0, NULL, NULL); //TODO CHECK THIS LINE
if (err) {
printf("Error: Failed to execute av_velocity_kernel!\n"); printf("\n%s",err_code(err)); fflush(stdout); return EXIT_FAILURE; }
我做错了吗?
此外,我注意到,如果我设置#define GLOBAL_DIM 60000
(这是我需要的),则会用完本地内存。 如果我使用多个工作组,或者本地内存在工作组之间平均分配,我会得到“更多”的本地内存吗?
首先,只有工作组大小为2的幂时,这些归约内核才能正常工作。 这意味着您应该使用64而不是60。此外,更改GLOBAL_DIM也不会使您用尽本地内存:调用内核时您很可能在做错事。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.