简体   繁体   English

OpenCL浮点数减少

[英]OpenCL float sum reduction

I would like to apply a reduce on this piece of my kernel code (1 dimensional data): 我想在我的内核代码(1维数据)上应用reduce:

__local float sum = 0;
int i;
for(i = 0; i < length; i++)
  sum += //some operation depending on i here;

Instead of having just 1 thread that performs this operation, I would like to have n threads (with n = length) and at the end having 1 thread to make the total sum. 我想要有n个线程(n =长度),最后有1个线程来计算总和,而不是只有1个线程执行此操作。

In pseudo code, I would like to able to write something like this: 在伪代码中,我希望能够写出这样的东西:

int i = get_global_id(0);
__local float sum = 0;
sum += //some operation depending on i here;
barrier(CLK_LOCAL_MEM_FENCE);
if(i == 0)
  res = sum;

Is there a way? 有办法吗?

I have a race condition on sum. 我总和有竞争条件。

To get you started you could do something like the example below ( see Scarpino ). 为了帮助您入门,您可以执行以下示例( 请参阅Scarpino )。 Here we also take advantage of vector processing by using the OpenCL float4 data type. 在这里,我们还通过使用OpenCL float4数据类型来利用向量处理。

Keep in mind that the kernel below returns a number of partial sums: one for each local work group, back to the host. 请记住,下面的内核会返回一些部分总和:每个本地工作组一个,返回主机。 This means that you will have to carry out the final sum by adding up all the partial sums, back on the host. 这意味着您必须通过将所有部分金额加回主机来执行最终总和。 This is because (at least with OpenCL 1.2) there is no barrier function that synchronizes work-items in different work-groups. 这是因为(至少在OpenCL 1.2中)没有屏障功能可以同步不同工作组中的工作项。

If summing the partial sums on the host is undesirable, you can get around this by launching multiple kernels. 如果不希望对主机上的部分和求和,可以通过启动多个内核来解决这个问题。 This introduces some kernel-call overhead, but in some applications the extra penalty is acceptable or insignificant. 这引入了一些内核调用开销,但在某些应用程序中,额外的惩罚是可接受的或无关紧要的。 To do this with the example below you will need to modify your host code to call the kernel repeatedly and then include logic to stop executing the kernel after the number of output vectors falls below the local size (details left to you or check the Scarpino reference ). 要使用下面的示例执行此操作,您需要修改主机代码以重复调用内核,然后包含逻辑以在输出向量的数量低于本地大小后停止执行内核(详细信息留给您或查看Scarpino参考 )。

EDIT: Added extra kernel argument for the output. 编辑:为输出添加了额外的内核参数。 Added dot product to sum over the float 4 vectors. 添加了点积乘以浮点4向量。

__kernel void reduction_vector(__global float4* data,__local float4* partial_sums, __global float* output) 
{
    int lid = get_local_id(0);
    int group_size = get_local_size(0);
    partial_sums[lid] = data[get_global_id(0)];
    barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = group_size/2; i>0; i >>= 1) {
        if(lid < i) {
            partial_sums[lid] += partial_sums[lid + i];
        }
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    if(lid == 0) {
        output[get_group_id(0)] = dot(partial_sums[0], (float4)(1.0f));
    }
}

I know this is a very old post, but from everything I've tried, the answer from Bruce doesn't work, and the one from Adam is inefficient due to both global memory use and kernel execution overhead. 我知道这是一个非常古老的帖子,但是从我尝试过的所有内容来看,Bruce的答案都不起作用,而且由于全局内存使用和内核执行开销,来自Adam的答案效率低下。

The comment by Jordan on the answer from Bruce is correct that this algorithm breaks down in each iteration where the number of elements is not even. 乔丹对布鲁斯答案的评论是正确的,这个算法在元素数量不均匀的每次迭代中都会崩溃。 Yet it is essentially the same code as can be found in several search results. 但它与几个搜索结果中的代码基本相同。

I scratched my head on this for several days, partially hindered by the fact that my language of choice is not C/C++ based, and also it's tricky if not impossible to debug on the GPU. 我对此感到头疼了好几天,部分受到我选择的语言不是基于C / C ++的事实的阻碍,而且如果不是不可能在GPU上进行调试也很棘手。 Eventually though, I found an answer which worked. 最后,我找到了一个有效的答案。

This is a combination of the answer by Bruce, and that from Adam. 这是布鲁斯和亚当的答案的结合。 It copies the source from global memory into local, but then reduces by folding the top half onto the bottom repeatedly, until there is no data left. 它将源从全局内存复制到本地,但是通过将上半部分重复折叠到底部来减少,直到没有数据为止。

The result is a buffer containing the same number of items as there are work-groups used (so that very large reductions can be broken down), which must be summed by the CPU, or else call from another kernel and do this last step on the GPU. 结果是一个缓冲区包含与使用的工作组相同数量的项目(因此可以分解非常大的减少量),这必须由CPU求和,或者从另一个内核调用并执行此操作的最后一步GPU。

This part is a little over my head, but I believe, this code also avoids bank switching issues by reading from local memory essentially sequentially. 这部分有点过头,但我相信,这段代码也可以通过从本地内存中按顺序读取来避免银行切换问题。 ** Would love confirmation on that from anyone that knows. **会对任何知道的人表示喜爱。

Note: The global 'AOffset' parameter can be omitted from the source if your data begins at offset zero. 注意:如果数据从偏移零开始,则可以从源中省略全局“AOffset”参数。 Simply remove it from the kernel prototype and the fourth line of code where it's used as part of an array index... 只需将其从内核原型和第四行代码中删除,它将其用作数组索引的一部分......

__kernel void Sum(__global float * A, __global float *output, ulong AOffset, __local float * target ) {
        const size_t globalId = get_global_id(0);
        const size_t localId = get_local_id(0);
        target[localId] = A[globalId+AOffset];

        barrier(CLK_LOCAL_MEM_FENCE);
        size_t blockSize = get_local_size(0);
        size_t halfBlockSize = blockSize / 2;
        while (halfBlockSize>0) {
            if (localId<halfBlockSize) {
                target[localId] += target[localId + halfBlockSize];
                if ((halfBlockSize*2)<blockSize) { // uneven block division
                    if (localId==0) { // when localID==0
                        target[localId] += target[localId + (blockSize-1)];
                    }
                }
            }
            barrier(CLK_LOCAL_MEM_FENCE);
            blockSize = halfBlockSize;
            halfBlockSize = blockSize / 2;
        }
        if (localId==0) {
            output[get_group_id(0)] = target[0];
        }
    }

https://pastebin.com/xN4yQ28N https://pastebin.com/xN4yQ28N

A simple and fast way to reduce data is by repeatedly folding the top half of the data into the bottom half. 减少数据的一种简单快捷的方法是将数据的上半部分重复折叠到下半部分。

For example, please use the following ridiculously simple CL code: 例如,请使用以下非常简单的CL代码:

__kernel void foldKernel(__global float *arVal, int offset) {
    int gid = get_global_id(0);
    arVal[gid] = arVal[gid]+arVal[gid+offset];
}

With the following Java/JOCL host code (or port it to C++ etc): 使用以下Java / JOCL主机代码(或将其移植到C ++等):

    int t = totalDataSize;
    while (t > 1) {
        int m = t / 2;
        int n = (t + 1) / 2;
        clSetKernelArg(kernelFold, 0, Sizeof.cl_mem, Pointer.to(arVal));
        clSetKernelArg(kernelFold, 1, Sizeof.cl_int, Pointer.to(new int[]{n}));
        cl_event evFold = new cl_event();
        clEnqueueNDRangeKernel(commandQueue, kernelFold, 1, null, new long[]{m}, null, 0, null, evFold);
        clWaitForEvents(1, new cl_event[]{evFold});
        t = n;
    }

The host code loops log2(n) times, so it finishes quickly even with huge arrays. 主机代码循环log2(n)次,因此即使使用大型数组也能快速完成。 The fiddle with "m" and "n" is to handle non-power-of-two arrays. “m”和“n”的小提琴是处理非二次幂阵列。

  • Easy for OpenCL to parallelize well for any GPU platform (ie fast ). 易于OpenCL可以很好地并行化任何GPU平台(即快速 )。
  • Low memory, because it works in place 内存不足,因为它可以正常工作
  • Works efficiently with non-power-of-two data sizes 使用非二次幂数据大小有效地工作
  • Flexible, eg you can change kernel to do "min" instead of "+" 灵活,例如你可以改变内核做“min”而不是“+”

如果您支持OpenCL C 2.0功能,则可以使用新的work_group_reduce_add()函数在单个工作组内减少总和

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM