cudaMemcpyAsync() not synchronizing after second kernel call

Question

My goal is to set a host variable passed by reference into a cuda kernel:

// nvcc test_cudaMemcpyAsync.cu -rdc=true
#include <iostream>

__global__ void setHostVar(double& host_var) {
  double const var = 2.0;
  cudaMemcpyAsync(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
  // identifier "cudaMemcpy" is undefined in device code
  // cudaMemcpy(&host_var, &var, sizeof(double), cudaMemcpyDeviceToHost);
}

int main() {
  double host_var = 1.0;

  setHostVar<<<1, 1>>>(host_var);
  cudaDeviceSynchronize();
  std::cout << "host_var = " << host_var << std::endl;

  setHostVar<<<1, 1>>>(host_var);
  cudaDeviceSynchronize();
  std::cout << "host_var = " << host_var << std::endl;

  return 0;
}

Compile and run:

$ nvcc test_cudaMemcpyAsync.cu -rdc=true
$ ./a.out

Output:

host_var = 1
host_var = 1

The first output line host_var = 1 I can understand given the asynchronous kernel call in addition to the asynchronous call to cudaMemcpyAsync() . However I would have thought that the second kernel call is executed after the prior async calls complete, yet host_var remains unchanged.

Questions

What is incorrect about my expectations?
What is the best/better way to set a host variable passed by reference/pointer into a kernel?

Version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Answer 1

What is incorrect about my expectations?

If we ignore managed memory and host-pinned memory (ie if we focus on typical host memory, such as what you are using here), it's a fundamental principle in CUDA that device code cannot touch/modify/access host memory (except on Power9 processor platforms). A direct extension of this is that you cannot (with those provisos) pass a reference to a CUDA kernel and expect to do anything useful with it.

If you really want to pass a variable by reference it will be necessary to use either managed memory or host-pinned memory. These require particular allocators and therefore depend on pointer usage for reference.

In any event, unless you are on a Power9 platform, there is no way to pass a reference to host-based stack memory to a CUDA kernel and use it, sensibly.

If you'd like to see sensible usage of memory between host and device, study any of the CUDA sample codes.

What is the best/better way to set a host variable passed by reference/pointer into a kernel?

The closest thing that I would recommend to what you have shown here would look like this (using a host-pinned allocator):

$ cat t14.cu
#include <iostream>

__global__ void setHostVar(double *host_var) {
  double const var = 2.0;
  *host_var = var;
}

int main() {
  double *host_var_ptr;
  cudaHostAlloc(&host_var_ptr, sizeof(double), cudaHostAllocDefault);
  *host_var_ptr = 1.0;

  setHostVar<<<1, 1>>>(host_var_ptr);
  cudaDeviceSynchronize();
  std::cout << "host_var = " << *host_var_ptr << std::endl;

  setHostVar<<<1, 1>>>(host_var_ptr);
  cudaDeviceSynchronize();
  std::cout << "host_var = " << *host_var_ptr << std::endl;

  return 0;
}
$ nvcc -o t14 t14.cu
$ cuda-memcheck ./t14
========= CUDA-MEMCHECK
host_var = 2
host_var = 2
========= ERROR SUMMARY: 0 errors
$

Although that may not adhere exactly to your request.

You may also be confused about how asynchronous is used in CUDA. Without trying to cover every aspect of the topic, CUDA kernels are launched asynchronously, meaning the CPU thread does not wait for the CUDA kernel to finish before proceeding. However cudaDeviceSynchronize() forces all previously issued work to that device to be complete before the CPU thread is allowed to proceed . That includes the kernel and anything involved with the kernel, such as data copying (however you do it) issued from kernel/device code. So we expect kernel activity to be complete/coherent after such a call.

cudaMemcpyAsync() not synchronizing after second kernel call

Question

Questions

Version

1 answers

solution1
2 ACCPTED 2020-10-10 15:56:42

cudaMemcpyAsync() not synchronizing after second kernel call

Question

Questions

Version

1 answers

solution1 2 ACCPTED 2020-10-10 15:56:42

solution1
2 ACCPTED 2020-10-10 15:56:42