简体繁体中英

CUDA shared memory - sum reduction from kernel

原文 2013-09-17 21:37:31 8 4 cuda

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?

What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.

4 answers

This seems unlikely:

I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

I can't imagine how you have enough space to store your data in shared memory but not in global memory.

Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.

Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one .

Or you could adapt the cuda sample code .

Update

After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.

In this case there's simpler solution.

You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.

You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.

In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.

That finishes the kernel design. And no shared mem is needed.

You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.

In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.

看看这是彻底解释了CUDA并行减少。

If I understand it correctly, each thread should sum up "only" 6 floats.

I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.

If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.

As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.

cuda shared memory, no synchronisation in kernel, premature output from kernel

CUDA Reduction on Shared Memory with Multiple Arrays

CUDA shuffle instruction reduction slower than shared memory reduction?

Calling sum reduction kernel from another kernel

Numba - Shared memory in CUDA kernel not updating correctly

Templated CUDA kernel with dynamic shared memory

Writing to Shared Memory in CUDA without the use of a kernel

CUDA: Using grid-strided loop with reduction in shared memory

Sum reduction with CUDA: What is N?

Simplest way to clear CUDA shared memory between kernel runs

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question cuda shared memory, no synchronisation in kernel, premature output from kernel CUDA Reduction on Shared Memory with Multiple Arrays CUDA shuffle instruction reduction slower than shared memory reduction? Calling sum reduction kernel from another kernel Numba - Shared memory in CUDA kernel not updating correctly Templated CUDA kernel with dynamic shared memory Writing to Shared Memory in CUDA without the use of a kernel CUDA: Using grid-strided loop with reduction in shared memory Sum reduction with CUDA: What is N? Simplest way to clear CUDA shared memory between kernel runs

Related Tags

CUDA shared memory - sum reduction from kernel

Question

4 answers

solution1
2 2013-09-17 21:54:59

solution2
1 2013-09-18 02:40:06

Update

solution3
0 2013-09-17 23:41:34

solution4
0 2013-09-18 11:23:12

CUDA shared memory - sum reduction from kernel

Question

4 answers

solution1 2 2013-09-17 21:54:59

solution2 1 2013-09-18 02:40:06

Update

solution3 0 2013-09-17 23:41:34

solution4 0 2013-09-18 11:23:12

solution1
2 2013-09-17 21:54:59

solution2
1 2013-09-18 02:40:06

solution3
0 2013-09-17 23:41:34

solution4
0 2013-09-18 11:23:12