[英]double sum reduction opencl tutorial
I am a total newbie to OpenCl. 我是OpenCl的新手。
I need to operate a reduction (sum operator) over a one dimensional array of doubles. 我需要在一维双精度数组上运算减少(求和运算符)。
I have been wandering around the net, but the examples I found are quite confusing. 我一直在网上徘徊,但我发现的例子很混乱。 Can anyone post an easy to read (and possibly efficient) tutorial implementation?
任何人都可以发布易于阅读(并且可能有效)的教程实现吗?
additional info: - I have access to one GPU device; 附加信息: - 我可以访问一个GPU设备; - I am using C for the kernel code
- 我使用C作为内核代码
You mentioned your problem involves 60k doubles, which won't fit into local memory of your device. 你提到你的问题涉及60k双打,它不适合你设备的本地内存。 I put together a kernel which will reduce your vector down to 10-30 or so values, which you can sum with your host program.
我整理了一个内核,它会将你的向量减少到10-30左右,你可以将它与你的宿主程序相加。 I am having trouble with doubles on my machine, but this kernel should work fine if you enable doubles and change 'float' to 'double' where you find it.
我在我的机器上遇到双打问题,但如果启用双打并且在找到它时将'float'更改为'double',则此内核应该可以正常工作。 I will debug the double problem I am having, and post an update.
我将调试我遇到的双重问题,并发布更新。
params: PARAMS:
Usage: 用法:
Potential optimizations: 潜在的优化:
make inVectorSize (and the vector) the highest multiple of the (workgroup size) * (number of work groups). 使inVectorSize(和向量)成为(工作组大小)*(工作组数)的最高倍数。 Call the kernel with only this amount of data.
仅使用此数量的数据调用内核。 The kernel splits up the data evenly.
内核均匀地分割数据。 Calculate the sum of any remaining data on the host while waiting for the callback (alternatively, build the same kernel for the cpu device and pass it only the remaining data).
在等待回调时计算主机上任何剩余数据的总和(或者,为cpu设备构建相同的内核并仅传递剩余数据)。 Start with this sum when adding outVector in step #5 above.
在上面的步骤#5中添加outVector时,从这个总和开始。 This optimization should keep the work groups evenly saturated throughout the computation.
此优化应使工作组在整个计算过程中保持饱和。
__kernel void floatSum(__global float* inVector, __global float* outVector, const int inVectorSize, __local float* resultScratch){ int gid = get_global_id(0); int wid = get_local_id(0); int wsize = get_local_size(0); int grid = get_group_id(0); int grcount = get_num_groups(0); int i; int workAmount = inVectorSize/grcount; int startOffest = workAmount * grid + wid; int maxOffest = workAmount * (grid + 1); if(maxOffset > inVectorSize){ maxOffset = inVectorSize; } resultScratch[wid] = 0.0; for(i=startOffest;i<maxOffest;i+=wsize){ resultScratch[wid] += inVector[i]; } barrier(CLK_LOCAL_MEM_FENCE); if(gid == 0){ for(i=1;i<wsize;i++){ resultScratch[0] += resultScratch[i]; } outVector[grid] = resultScratch[0]; }
} }
Also, enabling doubles: 另外,启用双打:
#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#ifdef cl_amd_fp64
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif
#endif
Update: AMD APP KernelAnalyzer got an update (v12), and it's showing that the double precision version of this kernel is in fact, ALU bound on the 5870 and 6970 cards. 更新:AMD APP KernelAnalyzer得到了更新(v12),它显示该内核的双精度版本实际上是ALU绑定在5870和6970卡上。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.