I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work? . What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Multiply kernel:
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.