简体繁体中英

Questions about CUDA latency hiding mechanism and shared memory

原文 2012-05-18 14:16:46 7 2 cuda/ shared-memory

I understand that to make a CUDA program efficient, we need to launch enough threads to hide the latency of expensive operations, such as global memory reads. For example, when a thread needs to read from global memory, the other threads will be scheduled to run so that the read operation overlaps with the execution of the threads. Therefore, the overall execution time for a CUDA program is just the sum of each thread's execution time, not including the time for global memory read. However, if we can put the data into shared memory and let the thread read from the shared memory, usually we can make the CUDA program run a lot faster. My confusion is that since the time for memory reads is hidden, it should not affect the program's performance. Why it can still impact the performance of the program so much?

2 answers

The very short answer is that just the mere act of using shared memory won't impart a performance improvement.

The act of reading from global memory to shared memory, then from shared memory - which is what is described in the question - has no benefit effect on performance whatsoever and is a common misconception (mostly the fault of the programming guide, which says shared memory is faster than global, leading to the conclusion that using it is a silver bullet).

The only way shared memory can ever help improve performance is facilitating coalescing of reads or writes to global memory (reducing memory transactions, improving cache coherence), or data sharing or reuse between threads (saving memory bandwidth), or as a faster scratch space than thread local memory stored in DRAM.

[This answer assembled from comments and added as a community wiki entry to get the question off the unanswered list]

If you make too much requests to the global memory, eventually all threads will mainly await for data from it (or finishing writing to it), so nobody will be able to hide its latency and lack of bandwidth.

Shared memory helps decrease reading/writing to/from global memory, especially useful in the cases like above (which are actually rather typical than occasional).

the overall execution time for a CUDA program is just the sum of each thread's execution time

No! It is time between first thread's start and last thread's finish.

Questions about CUDA memory

CUDA shared memory under the hood questions

Questions about Cuda 4.0 and unified memory model

CUDA shared memory writes incur unexplainable long latency

Cuda thread scheduling - latency hiding

Questions about cuda

the latency of acessing shared memory

Several questions about CUDA and cuPrintf

Questions about CUDA macro __CUDA_ARCH__

Shared memory allocation in CUDA

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Questions about CUDA memory CUDA shared memory under the hood questions Questions about Cuda 4.0 and unified memory model CUDA shared memory writes incur unexplainable long latency Cuda thread scheduling - latency hiding Questions about cuda the latency of acessing shared memory Several questions about CUDA and cuPrintf Questions about CUDA macro __CUDA_ARCH__ Shared memory allocation in CUDA

Related Tags

Questions about CUDA latency hiding mechanism and shared memory

Question

2 answers

solution1
3 ACCPTED

solution2
0 2021-12-02 12:26:40

Questions about CUDA latency hiding mechanism and shared memory

Question

2 answers

solution1 3 ACCPTED

solution2 0 2021-12-02 12:26:40

solution1
3 ACCPTED

solution2
0 2021-12-02 12:26:40