Why are overlapping data transfers in CUDA slower than expected?

Question

When I run the simpleMultiCopy in the SDK (4.0) on the Tesla C2050 I get the following results:

[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)

Measured timings (throughput):
 Memcpy host to device  : 2.725792 ms (6.154988 GB/s)
 Memcpy device to host  : 2.723360 ms (6.160484 GB/s)
 Kernel         : 0.611264 ms (274.467599 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms 
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized  : 6.113555 ms
 Avg. time when overlapped using 4 streams  : 4.308822 ms
 Avg. speedup gained (serialized - overlapped)  : 1.804733 ms

Measured throughput:
 Fully serialized execution     : 5.488530 GB/s
 Overlapped using 4 streams     : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED

This shows that the expected runtime is 2.7 ms, while it actually takes 4.3. What is it exactly that causes this discrepancy? (I've also posted this question at http://forums.developer.nvidia.com/devforum/discussion/comment/8976 .)

Answer 1

The first kernel launch cannot start until the first memcpy is completed, and the last memcpy cannot start until the last kernel launch is completed. So, there is "overhang" that introduces some of the overhead you are observing. You can decrease the size of the "overhang" by increasing the number of streams, but the streams' inter-engine synchronization incurs its own overhead.

It's important to note that overlapping compute+transfer doesn't always benefit a given workload - in addition to the overhead issues described above, the workload itself has to spend equal amounts of time doing compute and data transfer. Due to Amdahl's Law, the potential speedup of 2x or 3x falls off as the workload becomes either transfer-found or compute-bound.

Why are overlapping data transfers in CUDA slower than expected?

Question

1 answers

solution1
1 2012-02-11 12:28:29

Why are overlapping data transfers in CUDA slower than expected?

Question

1 answers

solution1 1 2012-02-11 12:28:29

solution1
1 2012-02-11 12:28:29