CUDA: Why memory transfers greater than 64KB are blocking calls?

Question

I'm a CUDA learning student and I'm trying to understand how memory transfers work. I read on Internet that memory transfers greater than 64KB are treated as blocking calls, while memory transfers under 64KB are non-blocking. I tried to explain it using my class notes, but I'm not sure of my reasoning. I suppose it's due to the fact that having transfers with a lot of data could cause idle times in which neither the GPU nor the CPU nor the memory bus are working, so it's better to have an overlap between computing and memory transfers. Nevertheless I don't understand why the limit is really 64KB and I'm not even sure what I just said is correct.

Can anyone help me or provide a better explanation? Thanks a lot.

EDIT: I found this information first of all in the following reference: https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

The slide is the number 7, entitled "Default Stream (aka Stream '0')"

Then I've deepened my research and I found this:

< As described by the CUDA C Programming Guide, asynchronous commands return control to the calling host thread before the device has finished the requested task (they are non-blocking) These commands are: • Kernel launches • Memory copies between two addresses to the same device memory • Memory copies from host to device of a memory block of 64 KB or less • Memory copies performed by functions with the Async suffix • Memory set function calls with the Async suffix >

in the following reference: http://gpu.di.unimi.it/slides/lezione7.pdf

But there are many other references like this.

Answer 1

Whether a call to cudaMemcpy or cudaMemcpyAsync is blocking or not is described in the CUDA API: https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

For cudaMemcpy it says for cuda 11.6:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

This could be interpreted as "a bit asynchronous" because the call may return before the transfer is completed, but not immediately depending on the total transfer size. It all depends on the size of the staging memory. The slides seems to be old as it mentions the Fermi architecture.

I tried to find out the current size of the staging buffer on my linux machine with CUDA 11.5 and driver 495.46 using the following code compiled with nvcc -O3 test1.cu -o test1

#include <iostream>

int main(){
    size_t maxsize = 1024ull*1024ull*1024ull;
    void* d_buffer;
    cudaMalloc(&d_buffer, maxsize);
    void* buffer = malloc(maxsize);

    for(size_t bytes = 1; bytes <= maxsize; bytes *= 2){
        std::cerr << bytes << "\n";
        cudaMemcpy(d_buffer, buffer, bytes, cudaMemcpyHostToDevice);
    }
    cudaDeviceSynchronize();
}

Profiled with nsight-systems one can see that up to 1MB transfer size no data is transfered until the API call returns, and the transfer has the expected throughput of 12GB/s. My interpretation is that the data has been fully copied to the staging buffer before the API returns which would mean that this buffer has a size of at least 1MB, not 64KB.

For transfers larger than 1MB, let's say 2MB, a different approach is used because the data transfer to GPU begins before the API call returns. For 2MB, the time difference between API end and transfer end is ~86µs. With a displayed throughput of 6.5 GB/s this corresponds to ~559 KB (in reality it is probably 512 KB). So, for larger than 1MB, the staging buffer seems to be split into half to allow for pipelining. While 512 KB are currently being transferred to the GPU, the next chunk of data can be staged to the remaining 512KB.

CUDA: Why memory transfers greater than 64KB are blocking calls?

Question

1 answers

solution1
1 2022-01-19 07:27:07

CUDA: Why memory transfers greater than 64KB are blocking calls?

Question

1 answers

solution1 1 2022-01-19 07:27:07

solution1
1 2022-01-19 07:27:07