简体   繁体   中英

double sum reduction opencl tutorial

I am a total newbie to OpenCl.

I need to operate a reduction (sum operator) over a one dimensional array of doubles.

I have been wandering around the net, but the examples I found are quite confusing. Can anyone post an easy to read (and possibly efficient) tutorial implementation?

additional info: - I have access to one GPU device; - I am using C for the kernel code

You mentioned your problem involves 60k doubles, which won't fit into local memory of your device. I put together a kernel which will reduce your vector down to 10-30 or so values, which you can sum with your host program. I am having trouble with doubles on my machine, but this kernel should work fine if you enable doubles and change 'float' to 'double' where you find it. I will debug the double problem I am having, and post an update.

params:

  • global float* inVector - the source of floats to sum
  • global float* outVector - a list of floats, one for each work group
  • const int inVectorSize - the total number of floats held by inVector
  • local float* resultScratch - local memory for each work group to use. you need to allocate one float per work item in the group. expected size = sizeof(cl_float)*get_local_size(0). For example, if you use 64 work items per group, this will be 64 floats = 256 bytes. Switching to doubles will make it 512 bytes. The minimum LDS size defined by the openCL specification is 16kb. See this question for more info about passing in a local (NULL) parameter.

Usage:

  1. Allocate memory for the input and output buffers.
  2. Create a work group for each compute unit you have on the device.
  3. Determine the optimal work group size, and use this to calculate the size of 'resultScratch'.
  4. Call the kernel, read outVector back to the host
  5. Loop through your copy of outVector and add to get your final sum.

Potential optimizations:

  1. As usual, you want to call the kernel with a lot of data. Too little data is not really worth the transfer and setup time.
  2. make inVectorSize (and the vector) the highest multiple of the (workgroup size) * (number of work groups). Call the kernel with only this amount of data. The kernel splits up the data evenly. Calculate the sum of any remaining data on the host while waiting for the callback (alternatively, build the same kernel for the cpu device and pass it only the remaining data). Start with this sum when adding outVector in step #5 above. This optimization should keep the work groups evenly saturated throughout the computation.

     __kernel void floatSum(__global float* inVector, __global float* outVector, const int inVectorSize, __local float* resultScratch){ int gid = get_global_id(0); int wid = get_local_id(0); int wsize = get_local_size(0); int grid = get_group_id(0); int grcount = get_num_groups(0); int i; int workAmount = inVectorSize/grcount; int startOffest = workAmount * grid + wid; int maxOffest = workAmount * (grid + 1); if(maxOffset > inVectorSize){ maxOffset = inVectorSize; } resultScratch[wid] = 0.0; for(i=startOffest;i<maxOffest;i+=wsize){ resultScratch[wid] += inVector[i]; } barrier(CLK_LOCAL_MEM_FENCE); if(gid == 0){ for(i=1;i<wsize;i++){ resultScratch[0] += resultScratch[i]; } outVector[grid] = resultScratch[0]; } 

    }

Also, enabling doubles:

#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#ifdef cl_amd_fp64
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif
#endif

Update: AMD APP KernelAnalyzer got an update (v12), and it's showing that the double precision version of this kernel is in fact, ALU bound on the 5870 and 6970 cards.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM