简体   繁体   中英

CUDA shared memory - sum reduction from kernel

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?

What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.

This seems unlikely:

I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.

I can't imagine how you have enough space to store your data in shared memory but not in global memory.

Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.

Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one .

Or you could adapt the cuda sample code .

Update

After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.

In this case there's simpler solution.

You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.

You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.

In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.

That finishes the kernel design. And no shared mem is needed.

You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.

In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.

看看是彻底解释了CUDA并行减少。

If I understand it correctly, each thread should sum up "only" 6 floats.

I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.

If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.

As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM