I am a total newbie to OpenCl.
I need to operate a reduction (sum operator) over a one dimensional array of doubles.
I have been wandering around the net, but the examples I found are quite confusing. Can anyone post an easy to read (and possibly efficient) tutorial implementation?
additional info: - I have access to one GPU device; - I am using C for the kernel code
You mentioned your problem involves 60k doubles, which won't fit into local memory of your device. I put together a kernel which will reduce your vector down to 10-30 or so values, which you can sum with your host program. I am having trouble with doubles on my machine, but this kernel should work fine if you enable doubles and change 'float' to 'double' where you find it. I will debug the double problem I am having, and post an update.
params:
Usage:
Potential optimizations:
make inVectorSize (and the vector) the highest multiple of the (workgroup size) * (number of work groups). Call the kernel with only this amount of data. The kernel splits up the data evenly. Calculate the sum of any remaining data on the host while waiting for the callback (alternatively, build the same kernel for the cpu device and pass it only the remaining data). Start with this sum when adding outVector in step #5 above. This optimization should keep the work groups evenly saturated throughout the computation.
__kernel void floatSum(__global float* inVector, __global float* outVector, const int inVectorSize, __local float* resultScratch){ int gid = get_global_id(0); int wid = get_local_id(0); int wsize = get_local_size(0); int grid = get_group_id(0); int grcount = get_num_groups(0); int i; int workAmount = inVectorSize/grcount; int startOffest = workAmount * grid + wid; int maxOffest = workAmount * (grid + 1); if(maxOffset > inVectorSize){ maxOffset = inVectorSize; } resultScratch[wid] = 0.0; for(i=startOffest;i<maxOffest;i+=wsize){ resultScratch[wid] += inVector[i]; } barrier(CLK_LOCAL_MEM_FENCE); if(gid == 0){ for(i=1;i<wsize;i++){ resultScratch[0] += resultScratch[i]; } outVector[grid] = resultScratch[0]; }
}
Also, enabling doubles:
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#ifdef cl_amd_fp64
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif
#endif
Update: AMD APP KernelAnalyzer got an update (v12), and it's showing that the double precision version of this kernel is in fact, ALU bound on the 5870 and 6970 cards.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.