I am trying to let cudaMemcpyHost2Device wait for some specific event by using cudaStreamAddCallback. And I found the comments about cudaStreamCallback API
The callback will block later work in the stream until it is finished.
So, later work like cudaMemcpyAsync to be blocked is expected. But later code assertion failed.
#include <cuda_runtime.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#include <unistd.h>
#include <stdio.h>
#define cuda_check(x) \
assert((x) == cudaSuccess)
const size_t size = 1024 * 1024;
static void CUDART_CB cuda_callback(
cudaStream_t, cudaError_t, void* host) {
float* host_A = static_cast<float*>(host);
for (size_t i = 0; i < size; ++i) {
host_A[i] = i;
}
printf("hello\n");
sleep(1);
}
int main(void) {
float* A;
cuda_check(cudaMalloc(&A, size * 4));
float* host_A = static_cast<float*>(malloc(size * 4));
float* result = static_cast<float*>(malloc(size * 4));
memset(host_A, 0, size * 4);
cuda_check(cudaMemcpy(A, host_A, size * 4, cudaMemcpyHostToDevice));
cudaStream_t stream;
cuda_check(cudaStreamCreate(&stream));
cuda_check(cudaStreamAddCallback(stream, cuda_callback, host_A, 0));
cuda_check(cudaMemcpyAsync(A, host_A, size * 4, cudaMemcpyHostToDevice,
stream));
cuda_check(cudaStreamSynchronize(stream));
cuda_check(cudaMemcpy(result, A, size * 4, cudaMemcpyDeviceToHost));
for (size_t i = 0; i < size; ++i) {
assert(result[i] == i);
}
return 0;
}
Your assumption about what is happening isn't really correct. If I use the profiler to collect the runtime API trace for your code (the cudaDeviceReset
was added by me to ensure profiling data is flushed), I see this:
124.79ms 129.57ms cudaMalloc
255.23ms 694.20us cudaMemcpy
255.93ms 38.881us cudaStreamCreate
255.97ms 123.44us cudaStreamAddCallback
256.09ms 1.00348s cudaMemcpyAsync
1.25957s 76.899us cudaStreamSynchronize
1.25965s 1.3067ms cudaMemcpy
1.26187s 71.884ms cudaDeviceReset
As you can see, the cudaMemcpyAsync
did get blocked by the callback (it took > 1.0 second to finish).
The fact that the copy failed to happen in the sequence you thought is likely caused by the fact that you are using a regular pageable host allocation, not pinned memory and expecting the callback to fire on the empty queue instantly. It is important to note that registering the stream callback and starting the copy occur less that 0.1 milliseconds from one another, and it is possible that the callback might not fire immediately (given it is in another thread), leaving the possibility that the copy will start before the callback function reacts to the empty queue condition.
Interestingly, if I change host_A
to a pinned allocation and run the code I get this API timeline:
124.21ms 130.24ms cudaMalloc
254.45ms 1.0988ms cudaHostAlloc
255.98ms 376.14us cudaMemcpy
256.36ms 33.841us cudaStreamCreate
256.39ms 87.303us cudaStreamAddCallback
256.48ms 17.208us cudaMemcpyAsync
256.50ms 1.00331s cudaStreamSynchronize
1.25981s 1.2880ms cudaMemcpy
1.26205s 68.506ms cudaDeviceReset
Note now that the cudaStreamSynchronize
is the call which is blocked. But in this case the program passes the assert, which is probably related to the scheduler correctly managing dependencies in the stream given the host memory is pinned.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.