Parallel Reduction 无法正常工作

Question

I have the following parallel kernel reduction written on OpenCL.我在 OpenCL 上编写了以下并行内核缩减。 I just want to sum all the values from the BlockSum array.我只想对BlockSum数组中的所有值BlockSum 。 While using the work_group_reduce_add(BlockSum[GetIndex]);使用work_group_reduce_add(BlockSum[GetIndex]); it works perfectly right, using the optimized code I read from https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/opencl/opencl-05-reduction.pdf?__blob=publicationFile (Slide 11) does not work correctly.使用我从https://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/opencl/opencl-05-reduction.pdf?__blob=publicationFile (幻灯片 11) 无法正常工作。 What seems to be the error here?这里似乎是什么错误？ The global_work_size is set to {16,16} as well as the local_work_size (meaning 256 threads in total for each workgroup). global_work_size 设置为 {16,16} 以及 local_work_size（意味着每个工作组总共有 256 个线程）。 In the case of the work_group_reduce_add I get 255 which is correct but with the optimized code I get 0在work_group_reduce_add的情况下，我得到 255，这是正确的，但使用优化的代码我得到 0

__kernel void Reduction()
{
        unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
        unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
        unsigned char GetGroup   = get_local_size(0); //16
        unsigned short  BlockSum[256];      
        int SumOfAll= 0;            
        
        unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16        
        
        BlockSum[GetIndex] = 1;             
        barrier(CLK_LOCAL_MEM_FENCE);       
        
        SumOfAll= work_group_reduce_add(BlockSum[GetIndex]); //works great  
        
        // BUT CODE BELOW DOES NOT SUM CORRECTLY
        /*
        for(unsigned short stride=128; stride>1; stride >>= 1) {
            
            if(GetIndex < stride)
                BlockSum[GetIndex] += BlockSum[GetIndex + stride];          
            barrier(CLK_LOCAL_MEM_FENCE);           
        }               
        if(GetIndex==0)             
            SumOfAll = BlockSum[0] + BlockSum[1];       
        barrier(CLK_LOCAL_MEM_FENCE);
        */
        printf("SumOfAll=%d\n",SumOfAll);
}

Answer 1

Ok problem fixed.好的问题解决了。 The BlockSum[256]; BlockSum[256]; was not declared as __local but as private memory (silently without the __local Address Space Qualifier) which means that every thread (or core) had its own copy of these data, but the optimized reduction code was looking for shared local memory data among threads, to sum up the values.没有声明为__local而是私有内存（悄悄地没有__local地址空间限定符），这意味着每个线程（或内核）都有自己的这些数据副本，但优化的归约代码正在寻找线程之间共享的本地内存数据，总结价值。 Also the variable int SumOfAll;还有变量int SumOfAll; should also be declared as __local with initialization or private in my case without any initialization before.在我的情况下，也应该声明为__local with initialization 或private ，之前没有任何初始化。 You choose.你选。

So the working kernel is now looking like this.所以工作内核现在看起来像这样。

I hope this type of error will help someone that is not cautious like myself.我希望这种类型的错误能帮助像我这样不谨慎的人。

__kernel void Reduction()
{
        unsigned char GetThreadX = get_local_id(0); //it takes values from 0..15
        unsigned char GetThreadY = get_local_id(1); //it takes values from 0..15
        unsigned char GetGroup   = get_local_size(0); //16

        //*********************************************************
        //below was the offending code and the root of the problem 
        //**********************************************************
        __local unsigned short  BlockSum[256];      
        int SumOfAll;           
        //**********************************************************
        
        unsigned short GetIndex = GetThreadX + (GetGroup * GetThreadY); // takes values 0..255, group=16        
        
        BlockSum[GetIndex] = 1;             
        barrier(CLK_LOCAL_MEM_FENCE);       
        
        //SumOfAll = work_group_reduce_add(BlockSum[GetIndex]); 
        
        // OPTIMIZED CODE BELOW NOW SUM UP CORRECTLY
        
        for(unsigned short stride=128; stride>1; stride >>= 1) {
            
            if(GetIndex < stride)
                BlockSum[GetIndex] += BlockSum[GetIndex + stride];          
            barrier(CLK_LOCAL_MEM_FENCE);           
        }               
        if(GetIndex==0)             
            SumOfAll = BlockSum[0] + BlockSum[1];       
        barrier(CLK_LOCAL_MEM_FENCE);
        
        printf("SumOfAll=%d\n",SumOfAll);
        


        
    
}

Parallel Reduction 无法正常工作

问题描述

1 个解决方案

解决方案1
0 2021-11-12 08:33:09

Parallel Reduction 无法正常工作

问题描述

1 个解决方案

解决方案1 0 2021-11-12 08:33:09

解决方案1
0 2021-11-12 08:33:09