簡體   English   中英

為什么CUDA中的重疊數據傳輸比預期的慢?

[英]Why are overlapping data transfers in CUDA slower than expected?

當我在Tesla C2050的SDK(4.0)中運行simpleMultiCopy時,會得到以下結果:

[simpleMultiCopy] starting...
[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)
> Device name: Tesla C2050
> CUDA Capability 2.0 hardware with 14 multi-processors
> scale_factor = 1.00
> array_size   = 4194304


Relevant properties of this CUDA device
(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")
(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution
    (compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)

Measured timings (throughput):
 Memcpy host to device  : 2.725792 ms (6.154988 GB/s)
 Memcpy device to host  : 2.723360 ms (6.160484 GB/s)
 Kernel         : 0.611264 ms (274.467599 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:
No overlap at all (transfer-kernel-transfer): 6.060416 ms 
Compute can overlap with one transfer: 5.449152 ms
Compute can overlap with both data transfers: 2.725792 ms

Average measured timings over 10 repetitions:
 Avg. time when execution fully serialized  : 6.113555 ms
 Avg. time when overlapped using 4 streams  : 4.308822 ms
 Avg. speedup gained (serialized - overlapped)  : 1.804733 ms

Measured throughput:
 Fully serialized execution     : 5.488530 GB/s
 Overlapped using 4 streams     : 7.787379 GB/s
[simpleMultiCopy] test results...
PASSED

這表明預期的運行時間為2.7毫秒,而實際上需要4.3毫秒。 究竟是什么導致這種差異? (我還將這個問題發布在http://forums.developer.nvidia.com/devforum/discussion/comment/8976上 。)

第一個內核啟動要等到第一個memcpy完成后才能啟動,而最后一個memcpy必須等到最后一個內核啟動完成后才能啟動。 因此,有一個“突出端”引入了您正在觀察的一些開銷。 您可以通過增加流的數量來減小“突出端”的大小,但是流的引擎間同步會產生其自身的開銷。

重要的是要注意,重疊的計算+傳輸並不能總是使給定的工作負載受益-除了上述開銷問題之外,工作負載本身還必須花費等量的時間進行計算和數據傳輸。 根據阿姆達爾定律,隨着工作負載變為基於傳輸的負載或受計算限制的工作,潛在的2倍或3倍加速下降。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM