简体   繁体   English

在并行缩减示例opencl中使用cl_float3

[英]Using cl_float3 in parallel reduction example opencl

I adapted the parallel reduction example for openCL for a bunch of floats. 我为openCL修改了一系列浮点数的并行缩减示例。 Now I wanted to expand the code to include cl_float3. 现在我想扩展代码以包含cl_float3。 So I want to find the minimum among a array of cl_float3. 所以我想在cl_float3数组中找到最小值。 I thought it was a straight forward expansion from float to float3 in kernel. 我认为这是从内核到float3的直接扩展。 But I am receiving garbage values when i return from the kernel. 但是当我从内核返回时,我正在接收垃圾值。 Below is the kernel: 以下是内核:

__kernel void pmin3(__global float3  *src,                                           
                __global float3  *gmin,                                           
                __local  float3  *lmin,                                           
                __global float  *dbg,                                            
                uint           nitems,                                          
                uint           dev)                                             
{                                                                                   
    uint count  = nitems     / get_global_size(0);                                   
    uint idx    = (dev == 0) ? get_global_id(0) * count                              
                        : get_global_id(0);                                     
    uint stride = (dev == 0) ? 1 : get_global_size(0);                               

    // Private min for the work-item                                                 

    float3 pmin = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);                                               

    for (int n = 0; n < count; n++, idx += stride) {                                 
       pmin.x = min(pmin.x,src[idx].x);
       pmin.y = min(pmin.y,src[idx].y);
       pmin.z = min(pmin.z,src[idx].z);                                                
    }                                                                                

    // Reduce values within the work-group into local memory                         

    barrier(CLK_LOCAL_MEM_FENCE);                                                    
    if (get_local_id(0) == 0)
    lmin[0] = (float3)(pow(2.0,32.0)-1,pow(2.0,32.0)-1,pow(2.0,32.0)-1);                                                          
    for (int n = 0; n < get_local_size(0); n++) {                                    
    barrier(CLK_LOCAL_MEM_FENCE);                                                  
    if (get_local_id(0) == n) {
                lmin[0].x = min(lmin[0].x,pmin.x);
                lmin[0].y = min(lmin[0].y,pmin.y);
                lmin[0].z = min(lmin[0].z,pmin.z);
       }                         
   }                                                                                                                                                             
   barrier(CLK_LOCAL_MEM_FENCE);                                                                                                                                    
   // Write to __global gmin which will contain the work-group minima                                                                                               
   if (get_local_id(0) == 0)
      gmin[get_group_id(0)] = lmin[0];                                                                                                       
   // Collect debug information                                                                                                                                       
   if (get_global_id(0) == 0) {                                                    
      dbg[0] = get_num_groups(0);                                                   
      dbg[1] = get_global_size(0);                                                  
      dbg[2] = count;                                                               
      dbg[3] = stride;                                                              
   }                                                                               
 }                      

 __kernel void min_reduce3( __global float3  *gmin)                                         
{                                                                                   
  for (int n = 0; n < get_global_size(0); n++) {                                   
    barrier(CLK_GLOBAL_MEM_FENCE);                                                 
    if (get_global_id(0) == n) {
                gmin[0].x = min(gmin[0].x,gmin[n].x);
                gmin[0].y = min(gmin[0].y,gmin[n].y);                     
                gmin[0].z = min(gmin[0].z,gmin[n].z);
      }
 }
 barrier(CLK_GLOBAL_MEM_FENCE);                                                                                                                              
}         

I think it is the problem with get_global_id(0) and get_global_size() which gives the entire size instead of the only the number of rows to be given. 我认为这是get_global_id(0)和get_global_size()的问题,它给出了整个大小而不是给出的唯一行数。 Any suggestions? 有什么建议么?

As others mentioned, float3 (and other type3 types) behave as float4 (and other type4 types) for the purposes of size and alignment. 正如其他人提到的, float3 (和其他type3类型)表现为float4 (和其他type4类型),用于大小和对齐。 This could also be seen using the built-in vec_step function, which returns the number of elements in the input object's type, but returns 4 for type3 objects. 使用内置的vec_step函数也可以看到这一点,该函数返回输入对象类型中的元素数,但对于type3对象返回4。

If your host code generates a packed float3 array - with each object taking the size and alignment of just 3 floats - then the proper way to use it from OpenCL is: 如果您的主机代码生成一个打包的 float3数组 - 每个对象只占用3个浮点数的大小和对齐 - 那么从OpenCL使用它的正确方法是:

  • Use a float* parameter instead of float3* 使用float*参数而不是float3*
  • Load the data using vload3 使用vload3加载数据
  • Store data using vstore3 使用vstore3存储数据

float3 is 16-byte aligned. float3是16字节对齐的。 See OpenCL specs 6.1.5. 请参阅OpenCL规范 6.1.5。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM