cudaStreamAddCallback doesn't block later cudaMemcpyAsync

Question

I am trying to let cudaMemcpyHost2Device wait for some specific event by using cudaStreamAddCallback. And I found the comments about cudaStreamCallback API

The callback will block later work in the stream until it is finished.

So, later work like cudaMemcpyAsync to be blocked is expected. But later code assertion failed.

#include <cuda_runtime.h>

#include <stdlib.h>
#include <string.h>
#include <cassert>
#include <unistd.h>
#include <stdio.h>

#define cuda_check(x) \
    assert((x) == cudaSuccess)

const size_t size = 1024 * 1024;


static void CUDART_CB cuda_callback(
        cudaStream_t, cudaError_t, void* host) {
    float* host_A = static_cast<float*>(host);
    for (size_t i = 0; i < size; ++i) {
        host_A[i] = i;
    }

    printf("hello\n");
    sleep(1);
}

int main(void) {

    float* A;
    cuda_check(cudaMalloc(&A, size * 4));
    float* host_A = static_cast<float*>(malloc(size * 4));
    float* result = static_cast<float*>(malloc(size * 4));

    memset(host_A, 0, size * 4);

    cuda_check(cudaMemcpy(A, host_A, size * 4, cudaMemcpyHostToDevice));

    cudaStream_t stream;
    cuda_check(cudaStreamCreate(&stream));

    cuda_check(cudaStreamAddCallback(stream, cuda_callback, host_A, 0));
    cuda_check(cudaMemcpyAsync(A, host_A, size * 4, cudaMemcpyHostToDevice,
                               stream));

    cuda_check(cudaStreamSynchronize(stream));
    cuda_check(cudaMemcpy(result, A, size * 4, cudaMemcpyDeviceToHost));

    for (size_t i = 0; i < size; ++i) {
        assert(result[i] == i);
    }

    return 0;
}

Answer 1

Your assumption about what is happening isn't really correct. If I use the profiler to collect the runtime API trace for your code (the cudaDeviceReset was added by me to ensure profiling data is flushed), I see this:

124.79ms  129.57ms  cudaMalloc
255.23ms  694.20us  cudaMemcpy
255.93ms  38.881us  cudaStreamCreate
255.97ms  123.44us  cudaStreamAddCallback
256.09ms  1.00348s  cudaMemcpyAsync
1.25957s  76.899us  cudaStreamSynchronize
1.25965s  1.3067ms  cudaMemcpy
1.26187s  71.884ms  cudaDeviceReset

As you can see, the cudaMemcpyAsync did get blocked by the callback (it took > 1.0 second to finish).

The fact that the copy failed to happen in the sequence you thought is likely caused by the fact that you are using a regular pageable host allocation, not pinned memory and expecting the callback to fire on the empty queue instantly. It is important to note that registering the stream callback and starting the copy occur less that 0.1 milliseconds from one another, and it is possible that the callback might not fire immediately (given it is in another thread), leaving the possibility that the copy will start before the callback function reacts to the empty queue condition.

Interestingly, if I change host_A to a pinned allocation and run the code I get this API timeline:

124.21ms  130.24ms  cudaMalloc
254.45ms  1.0988ms  cudaHostAlloc
255.98ms  376.14us  cudaMemcpy
256.36ms  33.841us  cudaStreamCreate
256.39ms  87.303us  cudaStreamAddCallback
256.48ms  17.208us  cudaMemcpyAsync
256.50ms  1.00331s  cudaStreamSynchronize
1.25981s  1.2880ms  cudaMemcpy
1.26205s  68.506ms  cudaDeviceReset

Note now that the cudaStreamSynchronize is the call which is blocked. But in this case the program passes the assert, which is probably related to the scheduler correctly managing dependencies in the stream given the host memory is pinned.

cudaStreamAddCallback doesn't block later cudaMemcpyAsync

Question

1 answers

solution1
1 ACCPTED

cudaStreamAddCallback doesn't block later cudaMemcpyAsync

Question

1 answers

solution1 1 ACCPTED

solution1
1 ACCPTED