Weird cudaMemcpyAsync synchronizing behavior

Question

The following code is for a test of the synchronous behavior of cudaMemcpyAsync.

#include <iostream>
#include <sys/time.h>

#define N 100000000

using namespace std;


int diff_ms(struct timeval t1, struct timeval t2) 
{
    return (((t1.tv_sec - t2.tv_sec) * 1000000) +
            (t1.tv_usec - t2.tv_usec))/1000;
}

double sumall(double *v, int n)
{
    double s=0;
    for (int i=0; i<n; i++) s+=v[i];
    return s;
}


int main()
{
    int i;

    cudaStream_t strm;
    cudaStreamCreate(&strm);

    double *h0;
    double *h1;
    cudaMallocHost(&h0,N*sizeof(double));
    cudaMallocHost(&h1,N*sizeof(double));

    for (i=0; i<N; i++) h0[i]=99./N;
    double *d; 
    cudaMalloc(&d,N*sizeof(double));

    struct timeval t1, t2; gettimeofday(&t1,NULL);
    cudaMemcpyAsync(d,h0,N*sizeof(double),cudaMemcpyHostToDevice,strm);
    gettimeofday(&t2, NULL); printf("cuda H->D %d takes: %d ms\n",i, diff_ms(t2, t1)); gettimeofday(&t1, NULL);
    cudaMemcpyAsync(h1,d,N*sizeof(double),cudaMemcpyDeviceToHost,strm);
    gettimeofday(&t2, NULL); printf("cuda D->H %d takes: %d ms\n",i, diff_ms(t2, t1)); gettimeofday(&t1, NULL);

    cout<<"sum h0: "<<sumall(h0,N)<<endl;
    cout<<"sum h1: "<<sumall(h1,N)<<endl;


    cudaStreamDestroy(strm);
    cudaFree(d);
    cudaFreeHost(h0);
    cudaFreeHost(h1);

    return 0;
}

The printout of h0/h1 suggest that cudaMemcpyAsync is synchronized with the host

sum h0: 99
sum h1: 99

however, the time difference encompassing the cudaMemcpyAsync calls suggests that they are not synchronized with the host

cuda H->D 100000000 takes: 0 ms
cuda D->H 100000000 takes: 0 ms

because this is not supported by the cuda-profiling results:

method=[ memcpyHtoDasync ] gputime=[ 154896.734 ] cputime=[ 17.000 ] 
method=[ memcpyDtoHasync ] gputime=[ 141175.578 ] cputime=[ 6.000 ]

Not sure why...

Answer 1

There's (at least) two things going on here.

Your first observation is that:

sum h0: 99
sum h1: 99

CUDA calls issued to the same stream will be executed in sequence. If you want overlap of CUDA calls with each other , they must be issued to separate streams. Since you are issuing the cuda memcpy to the device and from the device in the same stream, they will execute in sequence. The second one will not begin till the first one completes (even though both are queued up immediately). Therefore the data is intact (after the first cudaMemcpy), and you observe that both arrays generate the proper sum.

Your remaining observations are also consistent with each other. You are reporting that:

cuda H->D 100000000 takes: 0 ms
cuda D->H 100000000 takes: 0 ms

This is because both of these async calls return control to the host thread immediately and the call is queued up to be executed asynchronously to host execution. The call then proceeds in parallel with further host execution. Since control was returned to the host immediately, and you are timing the operations using host-based timing methods, they appear to take zero time.

Of course they do not actually take zero time, and your profiler results indicate that. Since the GPU is executing asynchronously to the CPU (including cudaMemcpyAsync in this case), the profiler shows the actual "real" time that the cudaMemcpy operation takes, reported as gputime and also the "apparent time" on the cpu, ie the amount of time the cpu required to launch the operation reported as cputime . Note that cputime is very small compared to gputime, ie it is almost instantaneous, so your host based timing methods report zero time. But they are not actually zero time to complete, and the profiler reports that.

If you used cudaEvent timing methods, you would see different results of course, which would be closer to your profiler gputime results.

Weird cudaMemcpyAsync synchronizing behavior

Question

1 answers

solution1
5 ACCPTED 2012-12-22 11:54:40

Weird cudaMemcpyAsync synchronizing behavior

Question

1 answers

solution1 5 ACCPTED 2012-12-22 11:54:40

solution1
5 ACCPTED 2012-12-22 11:54:40