简体繁体中英

parallel sum reduction implementation in opencl

原文 2015-07-31 15:22:47 7 1 opencl/ gpgpu

I am going through the sample code of NVIDIA provided at link

In the sample kernels code (file oclReduction_kernel.c ) reduce4 uses the technique of

1) unrolling and removing synchronization barrier for thread id < 32.

2) Apart from this the code uses the blockSize checks to sum the data in local memory. I think there in OpenCL we have get_local_size(0/1) to know the work group size. Block Size is confusing me.

I am not able to understand both the points mentioned above. Why and how these things helping out in optimization? Any explanation on reduce5 and reduce6 will be helpful as well.

1 answers

You have that pretty much explained in slide 21 and 22 of https://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf which @Marco13 linked in comments.

As reduction proceeds, # “active” threads decreases

When s <= 32, we have only one warp left

Instructions are SIMD synchronous within a warp.

That means when s <= 32:

We don't need to __syncthreads()

We don't need “if (tid < s)” because it doesn't save any work

Without unrolling, all warps execute every iteration of the for loop and if statement

And by https://www.pgroup.com/lit/articles/insider/v2n1a5.htm :

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp.

Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors.

Re 2) blockSize there looks to be size of the work group.

Parallel reduction sum on gpu computes wrong opencl

boosting parallel reduction OpenCL

Reduction algorithm implementation on openCL

OpenCL float sum reduction

double sum reduction opencl tutorial

opencl- parallel reduction without local memory

How to implement summation using parallel reduction in OpenCL?

OpenCL undefined behavior in parallel reduction algorithm

Parallel reduction using local memory in OpenCL

OpenCL 2.x - Sum Reduction function

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Parallel reduction sum on gpu computes wrong opencl boosting parallel reduction OpenCL Reduction algorithm implementation on openCL OpenCL float sum reduction double sum reduction opencl tutorial opencl- parallel reduction without local memory How to implement summation using parallel reduction in OpenCL? OpenCL undefined behavior in parallel reduction algorithm Parallel reduction using local memory in OpenCL OpenCL 2.x - Sum Reduction function

Related Tags

parallel sum reduction implementation in opencl

Question

1 answers

solution1 1 ACCPTED 2015-07-31 23:34:42

solution1
1 ACCPTED 2015-07-31 23:34:42