简体繁体 English

基本的CUDA共享内存

[英]Basic cuda shared memory

原文 2018-09-07 10:25:22 6 1 cuda

I am new to cuda and have a few questions regarding shared memroy: 我是cuda的新手，并且对共享记忆有一些疑问：

does every SM have the same amount of shared memory within the same GPU? 每个SM都在同一GPU中具有相同数量的共享内存吗？
How does an SM partition the shared memory amongst the blocks? SM如何在块之间划分共享内存？ Is it distributed equally ( ex. if there are 2 blocks, then each block gets half the shared memory within the SM regardless of how much is actually used ), or is it based on the needs? 它是否平均分配（例如，如果有2个块，则每个块获得SM内共享内存的一半，而不管实际使用了多少），还是基于需求？
My understanding of a shared memory bank is: shared memory is divided into 32 equally large memory banks. 我对共享内存库的理解是：共享内存分为32个相等的大内存库。 So does this mean per block ( ie eveyr block has their own 32 banks ) or is it per SM? 那么，这是否意味着每个块（即，eveyr块具有自己的32个库）还是每个SM？
If I perform a CudaMemcpy from / into shared memroy of more than one word, does this count as a single transaction or multiple transactions? 如果我从一个或多个单词的共享内存中执行CudaMemcpy，这算作一次交易还是多次交易？ And could this cause bank conflicts? 这会导致银行冲突吗？

Thanks! 谢谢！

1 个解决方案

Let me begin by pointing out that shared memory is, first and foremost, an abstraction of the programming model through which a certain feature of the hardware (fast, on-chip memory) is exposed. 首先让我指出，共享内存首先是编程模型的抽象，通过该抽象可以公开硬件的某些功能（快速的片上存储器）。 In the CUDA programming model, every block in a grid (kernel launch) gets the same amount of shared memory. 在CUDA编程模型中，网格中的每个块（内核启动）都获得相同数量的共享内存。 How much that is depends on the amount of statically allocated shared memory required by the kernel function as well as any additional dynamic shared memory specified in the kernel launch. 多少取决于内核功能所需的静态分配的共享内存的数量，以及内核启动中指定的任何其他动态共享内存。

does every SM have the same amount of shared memory within the same GPU? 每个SM都在同一GPU中具有相同数量的共享内存吗？

Yes, that is currently the case. 是的，目前就是这种情况。 However, this is not really as relevant for the way you program CUDA as you might think, because: 但是，这与您想象中的CUDA编程方式实际上并不相关，因为：

How does an SM partition the shared memory amongst the blocks? SM如何在块之间划分共享内存？ Is it distributed equally ( ex. if there are 2 blocks, then each block gets half the shared memory within the SM regardless of how much is actually used ), or is it based on the needs? 它是否平均分配（例如，如果有2个块，则每个块获得SM内共享内存的一半，而不管实际使用了多少），还是基于需求？

When you launch a kernel, you specify how much shared memory each block needs. 启动内核时，可以指定每个块需要多少共享内存。 This then informs how many blocks can fit on each multiprocessor. 然后，这将通知每个多处理器可以容纳多少个块。 So it's not that the number of blocks defines how much shared memory each block gets, but the other way around: the amount of shared memory needed per block is one of the factors that define how many blocks can reside on each multiprocessor. 因此，不是块的数量定义了每个块获得多少共享内存，而是相反：每个块所需的共享内存量是定义每个多处理器上可以驻留多少块的因素之一。

You will want to read up on latency hiding and occupancy as those are quite fundamental topics when it comes to GPU programming. 您将需要阅读有关延迟隐藏和占用的信息，因为这是涉及GPU编程的非常基本的主题。 For more details on the memory subsystems of different GPU architectures, have a look at the CUDA Programming Guide . 有关不同GPU架构的内存子系统的更多详细信息，请参阅《 CUDA编程指南》。

My understanding of a shared memory bank is: shared memory is divided into 32 equally large memory banks. 我对共享内存库的理解是：共享内存分为32个相等的大内存库。 So does this mean per block ( ie eveyr block has their own 32 banks ) or is it per SM? 那么，这是否意味着每个块（即，eveyr块具有自己的32个库）还是每个SM？

In the end, due to the SIMD (SIMT) nature of GPU cores, the actual program execution happens in warps. 最后，由于GPU内核的SIMD（SIMT）特性，实际的程序执行以扭曲方式发生。 When such a warp (currently, that effectively means a group of 32 threads) performs a shared memory access, bank conflicts will be an issue as the shared memory request generated by that instruction is served. 当这种扭曲（当前有效地意味着一组32个线程）执行共享内存访问时，随着该指令生成的共享内存请求得到处理，存储体冲突将成为一个问题。 It is not really documented whether shared memory requests for multiple warps can be served in parallel. 没有真正记载是否可以并行处理多个扭曲的共享内存请求。 My guess would be that there is only one unit to handle shared memory requests per SM and, thus, the answer is no. 我的猜测是每个SM只有一个单元来处理共享内存请求，因此答案是否定的。

If I perform a CudaMemcpy from / into shared memroy of more than one word, does this count as a single transaction or multiple transactions? 如果我从一个或多个单词的共享内存中执行CudaMemcpy，这算作一次交易还是多次交易？ And could this cause bank conflicts? 这会导致银行冲突吗？

You cannot cudaMemcpy() into shared memory. 您不能将cudaMemcpy()放入共享内存。 Shared memory is only accessible to device threads of the same block and it only persists for as long as that block is running. 共享内存只能由同一块的设备线程访问，并且仅在该块运行时才持久存在。