简体繁体 English

在CUDA中跨块的合并访问？

[英]Coalescesed access across blocks in CUDA?

原文 2013-02-27 20:59:05 5 2 cuda

Let us say we have 16 threads running on block 1 and another 16 threads running on block 2. 假设我们在块1上运行了16个线程，在块2上运行了另外16个线程。

Each thread reads 1 double from memory: the 16 threads on block 1 need to read 16 doubles from memory addresses 0-127, and 16 threads on block 2 need to read from addresses 128-255. 每个线程从内存读取1个双倍：块1上的16个线程需要从内存地址0-127读取16个双打，而块2上的16个线程需要从地址128-255读取。

I know that the memory reads for the 16 threads on block 1 can be done in one memory transaction because of coalesced accesses. 我知道，由于合并访问，块1上的16个线程的内存读取可以在一个内存事务中完成。

My question is, when we consider these two blocks, how many memory transactions do we need, one or two? 我的问题是，当我们考虑这两个块时，我们需要多少个内存事务，一个或两个？ In other words, can memory accesses by different blocks happen at the same time? 换句话说，不同块的内存访问可以同时发生吗？

2 个解决方案

Blocks are entirely independent - hardware may choose (and likely - will) to launch them on different multiprocessor. 块完全独立 - 硬件可以选择（并且可能 - 将）在不同的多处理器上启动它们。

Threads from different blocks will be ran in different warps. 来自不同块的线程将以不同的warp运行。 Hence it is impossible to coalesce memory accesses between them. 因此，不可能在它们之间合并存储器访问。

You need at least two memory transactions. 您至少需要两个内存事务。 For sure threads of each block will be handled in different warps. 确保每个块的线程将在不同的warp中处理。

Furthermore even if threads have formed one warp or have occupied the same multiprocessor and shared L1 cache, addresses from a warp are converted into lines of 128B or 32B (depends on caching/non-caching mode) therefore in a case of caching mode you would need at least 2 transactions and in a case of non-caching mode 8 transactions. 此外，即使线程已形成一个warp或占用相同的多处理器和共享L1缓存，来自warp的地址也会转换为128B或32B的行（取决于缓存/非缓存模式），因此在缓存模式的情况下，你会需要至少2个事务，并且在非缓存模式的情况下需要8个事务。 Look at this very useful presentation for better understanding of global memory access. 看看这个非常有用的演示文稿，以便更好地理解全局内存访问。