MATLAB GPU - Latency of CUDA memory copies?

Question

I am trying to measure the latencies of CUDA memory copies in MATLAB. I wrote the following routine, where a scalar is repeatedly copied to and from the GPU.

a=single(randn(1,1));

tic;
for j=1:50*1000

    aGpu=gpuArray(a);
    a2=gather(aGpu);

end
toc;

The execution time is approximately one second. Given that there are 50,000 iterations in the loop and my CPU works at 3.4/3.7 GHz, this means that copying a scalar back and forth takes approximately 70,000 CPU cycles on average. I am only copying a scalar, so I guess that the time to transfer the data is negligible and most of the time employed is latency. This kind of latency seems excessively high to me. I have read in various places that the latency of a CUDA memory copy is to be expected below 1,000 CPU cycles. Has anybody done similar experiments? Are my numbers strange? Is it a problem with MATLAB? Are there things that need to be set up in the system/GPU configuration in order to reduce latencies?

More details: I am working with Windows 7, Matlab 2014a, on an Intel i7 and a GTX770 GeForce GPU.

Answer 1

In your loop, you are measuring two memory copies, and it turns out that they are running in ~ 10 us each - which I actually think is not too bad at all (remembering that memory copies have essentially the same overhead as a kernel launch). For example, the following two papers estimate a latency of about 10 microseconds: 1) Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization ; 2) Latency and Bandwidth Impact on GPU-systems

MATLAB GPU - Latency of CUDA memory copies?

Question

1 answers

solution1
1 ACCPTED 2015-01-12 12:26:28

MATLAB GPU - Latency of CUDA memory copies?

Question

1 answers

solution1 1 ACCPTED 2015-01-12 12:26:28

solution1
1 ACCPTED 2015-01-12 12:26:28