简体   繁体   中英

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.

My question is how does fusion actually work? . What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.

I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:

int *device_A;
int *device_B;
int *device_C;

cudaMalloc(device_A,sizeof(int)*N);

cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);

KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);

cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU

The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:

  1. by reusing the data that a kernel may have populated either in registers or shared memory
  2. by reducing (ie eliminating) "redundant" loads and stores

Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:

  1. Load my element of vector A from global memory
  2. Add a constant to, or multiply by a constant, my vector element
  3. Store my element back out to vector A (in global memory)

This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:

Add kernel:

  1. Load my element of vector A from global memory
  2. Add a value to my vector element
  3. Store my element back out to vector A (in global memory)

Multiply kernel:

  1. Load my element of vector A from global memory
  2. Multiply my vector element by a value
  3. Store my element back out to vector A (in global memory)

We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.

But if we combine the two kernels together, we could write a kernel like this:

  1. Load my element of vector A from global memory
  2. Add a value to my vector element
  3. Multiply my vector element by a value
  4. Store my element back out to vector A (in global memory)

This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.

This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM