简体   繁体   中英

Coalescesed access across blocks in CUDA?

Let us say we have 16 threads running on block 1 and another 16 threads running on block 2.

Each thread reads 1 double from memory: the 16 threads on block 1 need to read 16 doubles from memory addresses 0-127, and 16 threads on block 2 need to read from addresses 128-255.

I know that the memory reads for the 16 threads on block 1 can be done in one memory transaction because of coalesced accesses.

My question is, when we consider these two blocks, how many memory transactions do we need, one or two? In other words, can memory accesses by different blocks happen at the same time?

Blocks are entirely independent - hardware may choose (and likely - will) to launch them on different multiprocessor.

Threads from different blocks will be ran in different warps. Hence it is impossible to coalesce memory accesses between them.

You need at least two memory transactions. For sure threads of each block will be handled in different warps.

Furthermore even if threads have formed one warp or have occupied the same multiprocessor and shared L1 cache, addresses from a warp are converted into lines of 128B or 32B (depends on caching/non-caching mode) therefore in a case of caching mode you would need at least 2 transactions and in a case of non-caching mode 8 transactions. Look at this very useful presentation for better understanding of global memory access.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM