简体   繁体   中英

Method to do final sum with reduction

I take up the continuation of my first issue explained on this link .

I remind you that I would like to apply a method which is able to do multiple sum reductions with OpenCL (my GPU device only supports OpenCL 1.2). I need to compute the sum reduction of an array to check the convergence criterion for each iteration of the main loop,

Currently, I did a version for only one sum reduction (ie one iteration ). In this version, and for simplicity, I have used a sequential CPU loop to compute the sum of each partial sum and get the final value of sum.

From your advices in my precedent, my issue is that I don't know how to perform the final sum by calling a second time the NDRangeKernel function (ie executing a second time the kernel code).

Indeed, with a second call, I will always face to the same problem for getting the sum of partial sums (itself computed from first call of NDRangeKernel ) : it seems to be a recursive issue.

Let's take an example from the above figure : if input array size is 10240000 and WorkGroup size is 16 , we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups .

So after the first call, I get 640000 partial sums : how to deal with the final sumation of all these partial sums ? If I call another time the kernel code with, for example, WorkGroup size = 16 and global size = 640000 , I will get nWorkGroups = 640000/16 = 40000 partial sums , so I have to call kernel code one more time and repeat this process till nWorkGroups < WorkGroup size .

Maybe I didn't understand very well the second stage, mostly this part of kernel code from "two-stage reduction" ( on this link, I think this is the case of searching for minimum into input array )

__kernel
void reduce(__global float* buffer,
            __local float* scratch,
            __const int length,
            __global float* result) {

  int global_index = get_global_id(0);
  float accumulator = INFINITY;
  // Loop sequentially over chunks of input vector
  while (global_index < length) {
    float element = buffer[global_index];
    accumulator = (accumulator < element) ? accumulator : element;
    global_index += get_global_size(0);
  }

  // Perform parallel reduction
  ...

If someone could explain what this above code snippet of kernel code does.

Is there a relation with the second stage of reduction, ie the final sumation ?

Feel free to ask me more details if you have not understood my issue.

Thanks

As mentioned in the comment: The statement

if input array size is 10240000 and WorkGroup size is 16, we get 10000*2^10/2^4 = 10000*2^6 = 640000 WorkGroups.

is not correct. You can choose an "arbitrary" work group size, and an "arbitrary" number of work groups. The numbers to choose here may be tailored for the target device. For example, the device may have a certain local memory size. This can be queried with clDeviceGetInfo :

cl_ulong localMemSize = 0;
clDeviceGetInfo(device, CL_DEVICE_LOCAL_MEM_SIZE, 
    sizeof(cl_ulong), &localMemSize, nullptr);

This may be used to compute the size of a local work group, considering the fact that each work group will require

sizeof(cl_float) * workGroupSize

bytes of local memory.

Similarly, the number of work groups may be derived from other device specific parameters.


The key point regarding the reduction itself is that the work group size does not limit the size of the array that can be processed . I also had some difficulties with understanding the algorithm as a whole, so I tried to explain it here, hoping that a few images may be worth a thousand words:

减少

As you can see, the number of work groups and the work group size are fixed and independent of the input array length: Even though I'm using 3 work groups with a size of 8 in the example (giving a global size of 24), an array of length 64 can be processed. This is mainly due to the first loop, which just walks through the input array, with a "step size" that is equal to the global work size (24 here). The result will be one accumulated value for each of the 24 threads. These are then reduced in parallel.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM