简体   繁体   中英

Questions about CUDA latency hiding mechanism and shared memory

I understand that to make a CUDA program efficient, we need to launch enough threads to hide the latency of expensive operations, such as global memory reads. For example, when a thread needs to read from global memory, the other threads will be scheduled to run so that the read operation overlaps with the execution of the threads. Therefore, the overall execution time for a CUDA program is just the sum of each thread's execution time, not including the time for global memory read. However, if we can put the data into shared memory and let the thread read from the shared memory, usually we can make the CUDA program run a lot faster. My confusion is that since the time for memory reads is hidden, it should not affect the program's performance. Why it can still impact the performance of the program so much?

The very short answer is that just the mere act of using shared memory won't impart a performance improvement.

The act of reading from global memory to shared memory, then from shared memory - which is what is described in the question - has no benefit effect on performance whatsoever and is a common misconception (mostly the fault of the programming guide, which says shared memory is faster than global, leading to the conclusion that using it is a silver bullet).

The only way shared memory can ever help improve performance is facilitating coalescing of reads or writes to global memory (reducing memory transactions, improving cache coherence), or data sharing or reuse between threads (saving memory bandwidth), or as a faster scratch space than thread local memory stored in DRAM.

[This answer assembled from comments and added as a community wiki entry to get the question off the unanswered list]

If you make too much requests to the global memory, eventually all threads will mainly await for data from it (or finishing writing to it), so nobody will be able to hide its latency and lack of bandwidth.

Shared memory helps decrease reading/writing to/from global memory, especially useful in the cases like above (which are actually rather typical than occasional).

the overall execution time for a CUDA program is just the sum of each thread's execution time

No! It is time between first thread's start and last thread's finish.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM