用减少做最终总和的方法

Question

I take up the continuation of my first issue explained on this link . 我接受了在这个链接上解释的第一个问题的延续。

I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). 我提醒你，我想应用一种能够用OpenCL进行多次减少的方法（我的GPU设备只支持OpenCL 1.2）。 I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop, 我需要计算一个数组的总和减少来检查主循环的每次迭代的收敛标准，

Currently, I did a version for only one sum reduction (ie one iteration ). 目前，我只做了一次减少（即一次迭代）的版本。 In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum. 在这个版本中，为简单起见，我使用了一个顺序CPU循环来计算每个部分和的总和并得到sum的最终值。

From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (ie executing a second time the kernel code). 根据我在先例中的建议，我的问题是我不知道如何通过再次调用NDRangeKernel函数（即第二次执行内核代码）来执行最终总和。

Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel ) : it seems to be a recursive issue. 实际上，通过第二次调用，我总是会遇到同样的问题，即获得部分和的总和（本身是从第一次调用NDRangeKernel计算出来的）：它似乎是一个递归问题。

Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16 , we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups . 让我们从上图中举例说明：如果输入数组大小为10240000且WorkGroup size为16 ，则得到10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups 。

So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? 所以在第一次调用之后，我获得了640000 partial sums ：如何处理所有这些部分总和的最终结果？ If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000 , I will get nWorkGroups = 640000/16 = 40000 partial sums , so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size . 如果我再次调用内核代码，例如WorkGroup size = 16和global size = 640000 ，我将得到nWorkGroups = 640000/16 = 40000 partial sums ，所以我必须再次调用内核代码并重复此过程直到nWorkGroups < WorkGroup size 。

Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array ) 也许我不太了解第二阶段，大部分内核代码来自“两阶段缩减”（在此链接上，我认为这是搜索输入数组的最小值的情况）

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  ...

If someone could explain what this above code snippet of kernel code does. 如果有人能够解释上面代码片段内核代码的作用。

Is there a relation with the second stage of reduction, ie the final sumation ? 是否与第二阶段的减少有关系，即最终的结果？

Feel free to ask me more details if you have not understood my issue. 如果您不了解我的问题，请随时向我询问更多详细信息。

Thanks 谢谢

Answer 1

As mentioned in the comment: The statement 如评论中所述：声明

if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups. 如果输入数组大小为10240000且WorkGroup大小为16，则得到10000 * 2 ^ 10/2 ^ 4 = 10000 * 2 ^ 6 = 640000个工作组。

is not correct. 是不正确的。 You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. 您可以选择“任意”工作组大小和“任意”数量的工作组。 The numbers to choose here may be tailored for the target device. 这里选择的数字可以针对目标设备定制。 For example, the device may have a certain local memory size. 例如，设备可能具有某种本地存储器大小。 This can be queried with clDeviceGetInfo : 可以使用clDeviceGetInfo查询：

cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE, 
    sizeof(cl_ulong), &localMemSize, nullptr);

This may be used to compute the size of a local work group, considering the fact that each work group will require 考虑到每个工作组将需要的事实，这可用于计算本地工作组的大小

sizeof(cl_float) * workGroupSize

bytes of local memory. 本地内存的字节数。

Similarly, the number of work groups may be derived from other device specific parameters. 类似地，工作组的数量可以从其他设备特定参数导出。

The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed . 关于减少本身的关键点是工作组大小不限制可以处理的数组的大小 。 I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words: 我对整个算法的理解也有些困难，所以我试着在这里解释一下，希望一些图像可能胜过千言万语：

As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. 如您所见，工作组的数量和工作组大小是固定的，与输入数组长度无关：即使我在示例中使用了3个大小为8的工作组（全局大小为24），可以处理长度为64的数组。 This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). 这主要是由于第一个循环，它只是遍历输入数组，其“步长”等于全局工作大小（此处为24）。 The result will be one accumulated value for each of the 24 threads. 其结果将是对于每个24个线程之一累计值。 These are then reduced in parallel. 然后这些并行减小。

用减少做最终总和的方法

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-05-07 18:39:23

用减少做最终总和的方法

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-05-07 18:39:23

解决方案1
3 已采纳 2016-05-07 18:39:23