简体   繁体   中英

parallel sum reduction implementation in opencl

I am going through the sample code of NVIDIA provided at link

In the sample kernels code (file oclReduction_kernel.c ) reduce4 uses the technique of

1) unrolling and removing synchronization barrier for thread id < 32.

2) Apart from this the code uses the blockSize checks to sum the data in local memory. I think there in OpenCL we have get_local_size(0/1) to know the work group size. Block Size is confusing me.

I am not able to understand both the points mentioned above. Why and how these things helping out in optimization? Any explanation on reduce5 and reduce6 will be helpful as well.

You have that pretty much explained in slide 21 and 22 of https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf which @Marco13 linked in comments.

  • As reduction proceeds, # “active” threads decreases
    • When s <= 32, we have only one warp left
  • Instructions are SIMD synchronous within a warp.
  • That means when s <= 32:
    • We don't need to __syncthreads()
    • We don't need “if (tid < s)” because it doesn't save any work

Without unrolling, all warps execute every iteration of the for loop and if statement

And by https://www.pgroup.com/lit/articles/insider/v2n1a5.htm :

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp.

Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors.

Re 2) blockSize there looks to be size of the work group.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM