简体   繁体   English

双减少opencl教程

[英]double sum reduction opencl tutorial

I am a total newbie to OpenCl. 我是OpenCl的新手。

I need to operate a reduction (sum operator) over a one dimensional array of doubles. 我需要在一维双精度数组上运算减少(求和运算符)。

I have been wandering around the net, but the examples I found are quite confusing. 我一直在网上徘徊,但我发现的例子很混乱。 Can anyone post an easy to read (and possibly efficient) tutorial implementation? 任何人都可以发布易于阅读(并且可能有效)的教程实现吗?

additional info: - I have access to one GPU device; 附加信息: - 我可以访问一个GPU设备; - I am using C for the kernel code - 我使用C作为内核代码

You mentioned your problem involves 60k doubles, which won't fit into local memory of your device. 你提到你的问题涉及60k双打,它不适合你设备的本地内存。 I put together a kernel which will reduce your vector down to 10-30 or so values, which you can sum with your host program. 我整理了一个内核,它会将你的向量减少到10-30左右,你可以将它与你的宿主程序相加。 I am having trouble with doubles on my machine, but this kernel should work fine if you enable doubles and change 'float' to 'double' where you find it. 我在我的机器上遇到双打问题,但如果启用双打并且在找到它时将'float'更改为'double',则此内核应该可以正常工作。 I will debug the double problem I am having, and post an update. 我将调试我遇到的双重问题,并发布更新。

params: PARAMS:

  • global float* inVector - the source of floats to sum global float * inVector - 求和的浮点数的来源
  • global float* outVector - a list of floats, one for each work group global float * outVector - 一个浮点列表,每个工作组一个浮点数
  • const int inVectorSize - the total number of floats held by inVector const int inVectorSize - inVector持有的浮点数总数
  • local float* resultScratch - local memory for each work group to use. local float * resultScratch - 要使用的每个工作组的本地内存。 you need to allocate one float per work item in the group. 您需要为组中的每个工作项分配一个浮点数。 expected size = sizeof(cl_float)*get_local_size(0). expected size = sizeof(cl_float)* get_local_size(0)。 For example, if you use 64 work items per group, this will be 64 floats = 256 bytes. 例如,如果每组使用64个工作项,则这将是64个浮点数= 256个字节。 Switching to doubles will make it 512 bytes. 切换到双精度将使其成为512字节。 The minimum LDS size defined by the openCL specification is 16kb. openCL规范定义的最小LDS大小为16kb。 See this question for more info about passing in a local (NULL) parameter. 有关传入本地(NULL)参数的详细信息, 请参阅此问题

Usage: 用法:

  1. Allocate memory for the input and output buffers. 为输入和输出缓冲区分配内存。
  2. Create a work group for each compute unit you have on the device. 为设备上的每个计算单元创建一个工作组。
  3. Determine the optimal work group size, and use this to calculate the size of 'resultScratch'. 确定最佳工作组大小,并使用它来计算'resultScratch'的大小。
  4. Call the kernel, read outVector back to the host 调用内核,将outVector读回主机
  5. Loop through your copy of outVector and add to get your final sum. 循环遍历outVector的副本并添加以获得最终总和。

Potential optimizations: 潜在的优化:

  1. As usual, you want to call the kernel with a lot of data. 像往常一样,您希望使用大量数据调用内核。 Too little data is not really worth the transfer and setup time. 太少的数据不值得转移和设置时间。
  2. make inVectorSize (and the vector) the highest multiple of the (workgroup size) * (number of work groups). 使inVectorSize(和向量)成为(工作组大小)*(工作组数)的最高倍数。 Call the kernel with only this amount of data. 仅使用此数量的数据调用内核。 The kernel splits up the data evenly. 内核均匀地分割数据。 Calculate the sum of any remaining data on the host while waiting for the callback (alternatively, build the same kernel for the cpu device and pass it only the remaining data). 在等待回调时计算主机上任何剩余数据的总和(或者,为cpu设备构建相同的内核并仅传递剩余数据)。 Start with this sum when adding outVector in step #5 above. 在上面的步骤#5中添加outVector时,从这个总和开始。 This optimization should keep the work groups evenly saturated throughout the computation. 此优化应使工作组在整个计算过程中保持饱和。

     __kernel void floatSum(__global float* inVector, __global float* outVector, const int inVectorSize, __local float* resultScratch){ int gid = get_global_id(0); int wid = get_local_id(0); int wsize = get_local_size(0); int grid = get_group_id(0); int grcount = get_num_groups(0); int i; int workAmount = inVectorSize/grcount; int startOffest = workAmount * grid + wid; int maxOffest = workAmount * (grid + 1); if(maxOffset > inVectorSize){ maxOffset = inVectorSize; } resultScratch[wid] = 0.0; for(i=startOffest;i<maxOffest;i+=wsize){ resultScratch[wid] += inVector[i]; } barrier(CLK_LOCAL_MEM_FENCE); if(gid == 0){ for(i=1;i<wsize;i++){ resultScratch[0] += resultScratch[i]; } outVector[grid] = resultScratch[0]; } 

    } }

Also, enabling doubles: 另外,启用双打:

#ifdef cl_khr_fp64
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#else
#ifdef cl_amd_fp64
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
#endif
#endif

Update: AMD APP KernelAnalyzer got an update (v12), and it's showing that the double precision version of this kernel is in fact, ALU bound on the 5870 and 6970 cards. 更新:AMD APP KernelAnalyzer得到了更新(v12),它显示该内核的双精度版本实际上是ALU绑定在5870和6970卡上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM