简体   繁体   English

OpenCL:并行求和 n 个整数

[英]OpenCL: sum in parallel of n integers

I need to create an OpenCL kernel function which uses parallel algorithm to sum n integers from an array numbers .我需要创建一个 OpenCL 内核函数,它使用并行算法从数组numbers求和n整数。

I should use an algorithm similar to the following:我应该使用类似于以下的算法:

parallel_summation(A):
    # ASSUME n = |A| is a power of 2 for simplicity

    # level 0:
    in parallel do:
      s[i] = A[i]                 for i = 0, 1, 2, ..., n-1

    # level 1:
    in parallel do:
      s[i] = s[i] + s[i+1]        for i = 0, 2, 4, ...

    # level 2:
    in parallel do:
      s[i] = s[i] + s[i+2]        for i = 0, 4, 8, ...

    # level 3:
    in parallel do:
      s[i] = s[i] + s[i+4]        for i = 0, 8, 16, ...

    # ...
    # level log_2( n ):
    s[0] = s[0] + s[n/2]

    return s[0]

So, I came up with the following kernel code:所以,我想出了以下内核代码:

kernel void summation(global uint* numbers,
                      global uint* sum,
                      const  uint  n,
                      const  uint  work_group_size,
                      local  uint* work_group_buf,
                      const  uint  num_of_levels) {

    // lets assume for now that the workgroup's size is 16,
    // which is a power of 2.


    int i = get_global_id(0);

    if(i >= n)
        return;

    int local_i = get_local_id(0);

    uint step = 1;
    uint offset = 0;

    for(uint k = 0; k < num_of_levels; ++k) {

        if(k == 0) {

            work_group_buf[local_i] = numbers[i];

        }  else {

            if(local_i % step == 0) {
                work_group_buf[local_i] += work_group_buf[local_i + offset];
            }

        }

        if(offset == 0) {
            offset = 1;
        } else {
            offset *= 2;
        }

        step *= 2;

        barrier(CLK_LOCAL_MEM_FENCE);

    }

     atomic_add(sum, work_group_buf[0]);

}

But there's a bug because I'm not receiving the expected results.但是有一个错误,因为我没有收到预期的结果。 numbers is a buffer that contains numbers from 1 to n . numbers是一个缓冲区,包含从1n数字。 num_of_levels is log 2 (number of work items per work group), which in my current example is 4 (log 2 (16)). num_of_levels是 log 2 (每个工作组的工作项数),在我当前的示例中是 4(log 2 (16))。

What am I doing wrong?我做错了什么?

Note: I'm not receiving any error, is just the result which is wrong.注意:我没有收到任何错误,只是结果错误。 For example, I've an array of 1000000 elements from 0 to 999999, and the sum of those elements should be 1783293664, but I'm getting 1349447424.例如,我有一个从 0 到 999999 的 1000000 个元素的数组,这些元素的总和应该是 1783293664,但我得到的是 1349447424。

I fixed a few bugs.我修复了一些错误。 There were a few mistakes and I was missing this part s[0] = s[0] + s[n/2] , as you can see from this new version.有一些错误,我错过了这部分s[0] = s[0] + s[n/2] ,正如您从这个新版本中看到的那样。

kernel void summation(global uint* numbers,
                          global uint* sum,
                          const  uint  n,
                          local  uint* work_group_buf,
                          const  uint  num_of_levels) {

        const int i = get_global_id(0);
        const int local_i = get_local_id(0);

        private uint step = 2;
        private uint offset = 1;


        if(i < n)
            work_group_buf[local_i] = numbers[i];

        barrier(CLK_LOCAL_MEM_FENCE);

        for(uint k = 1; k < num_of_levels; ++k) {

            if((local_i % step) == 0) {
                work_group_buf[local_i] += work_group_buf[local_i + offset];
            }

            offset *= 2;
            step *= 2;

            barrier(CLK_LOCAL_MEM_FENCE);
        }

        work_group_buf[0] += work_group_buf[get_local_size(0) / 2];

        if(local_i == 0)
            atomic_add(sum, work_group_buf[0]);

}

Note that now I'm adding to the final sum just the first element of each work_group_buf (ie work_group_buf[0] ) only if the local_i == 0 , because that position will contain the sum of all elements in the workgroup.请注意,现在我仅在local_i == 0时才将每个work_group_buf (即work_group_buf[0] )的第一个元素添加到最终sum ,因为该位置将包含工作组中所有元素的总和。

This actually seems to work for workgroups of size up to 32 (which are a power of 2).这实际上似乎适用于大小不超过 32(这是 2 的幂)的工作组。 In other words, this kernel seems to work only for workgroups of size 2, 4, 8, 16 and 32 work items.换句话说,这个内核似乎只适用于大小为 2、4、8、16 和 32 工作项的工作组。

You may do it simpler.你可以做得更简单。 It is very fast but work on OpenCL 1.2+ only.它非常快,但仅适用于 OpenCL 1.2+。

inline void sum(__global int* a, int v)
{
    int s = +1 * v;
    int n = 0;
    int o = 0;
    do {
        n = s + atom_xchg(a, o);
        s = o + atom_xchg(a, n);
    }
    while (s != o);
}

__kernel void sum_kernel(__global int *set, __global int* out)
{
    int i = (get_group_id(0) + get_group_id(1)*get_num_groups(0)) * get_local_size(0) + get_local_id(0);

    sum(out, set[i]);
}

From: GitHub - Hello GPU Compute World!来自: GitHub - 你好 GPU 计算世界!

Thanks!谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM