如果我使用共享内存，可以分配多少块？

Question

I am new in Cuda programming. 我是Cuda编程的新手。 I have access to the Device "Tesla K10". 我可以使用设备“ Tesla K10”。 I am working on a complex problem which needs about 20 KBytes of memory per instance of the problem . 我正在处理一个复杂的问题 ，每个问题实例大约需要20 KB的内存。 Now since cuda provides parallelizing, I have decided to use 96 threads (keeping in mind about warps) per block to solve an instance of the problem. 现在，由于cuda提供了并行化，我决定每个块使用96个线程（牢记扭曲）来解决问题的一个实例。 Now issue is i have a very very large number of such problems to be solved (say more than 1,600,000). 现在的问题是我有很多这样的问题需要解决（比如说超过1,600,000）。 I am aware that such large memory requirement will not fit even in global memory (which in my case is 3.5 GBytes as shown in the DeviceQuery output below) so i have to solve using few number of problems at a time. 我知道，如此大的内存需求甚至无法适应全局内存（在我的情况下为3.5 GB，如下面的DeviceQuery输出所示），因此我必须一次解决几个问题。

Also, I have mapped each problem with each block to solve an instance of the problem. 此外，我已将每个问题与每个块进行映射以解决问题的一个实例。

Now at present I am able to solve large number of problems with the data in the global memory. 现在，我现在能够解决全局存储器中数据的大量问题。 But Shared Memory being faster than global so i am planning to use the shared memory 20 KBytes (per problem). 但是共享内存比全局内存快，因此我计划使用20 KB（每个问题）共享内存。

1) Now my confusion is this will permit me only 2 problems to be loaded in the shared memory to be solved at a time( ie, 40KBytes < 48 KBytes of shared memory). 1）现在我的困惑是，这一次将只允许我在共享内存中加载两个要解决的问题（即40KB <48 KB共享内存）。 (based on my understanding about cuda, please correct me if i am worng). （基于我对cuda的理解，如果我不满意，请指正我）。

2) If i declare array with this 20 KBytes in the kernel does it mean that this (20KBytes * number_of_blocks) will be the shared memory use? 2）如果我在内核中声明了具有此20 KB的数组，这是否意味着这个（20 KB * number_of_blocks）将被共享内存使用？ By number_of_blocks i mean the number of problems to be solved. 我说的number_of_blocks是要解决的问题数。 My launch configuration is problem<>>(...) 我的启动配置有问题<>>（...）

All Your help in this regard will be highly acknowledge. 您在这方面的所有帮助将得到高度认可。 Thanking you in advance. 预先感谢您。

***My partial Device Query***

Device : "Tesla K10.G1.8GB"
  CUDA Driver Version / Runtime Version          6.5 / 5.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 3584 MBytes (3757637632 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Clock rate:                                745 MHz (0.75 GHz)
  Memory Clock rate:                             524 Mhz
  Memory Bus Width:                              2048-bit
  L2 Cache Size:                                 4204060 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 2046), 65536 layers
  Total amount of constant memory:               65536 bytes
  **Total amount of shared memory per block:       49152 bytes**
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  0
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
...

Answer 1

First, a quick summary to check I understand correctly: 首先，快速总结一下我是否正确理解：

You have ~1.5M problems to solve, these are completely independent, ie embarrassingly parallel 您需要解决约150万个问题，这些问题是完全独立的，即令人尴尬的并行
Each problem has a data set of ~20 KB 每个问题的数据集约为20 KB

Taking this whole problem would require >30 GB of memory, so it's clear that you will need to split the set of problems into batches. 解决整个问题将需要大于30 GB的内存，因此很显然，您需要将问题集分成批处理。 With your 4 GB card (~3.5 GB usable with ECC on etc.) you can fit about 150,000 problems at any time. 使用您的4 GB卡（〜3.5 GB可用于ECC等），您随时可以容纳大约150,000个问题。 If you were to double buffer these to allow concurrent transfer of the next batch with the computation of the current batch, then you're looking at 75K problems in a batch (maybe fewer if you need space for output etc.). 如果要对它们进行双缓冲，以允许同时传输下一个批次与当前批次的计算，那么您正在研究的是一个批次中的75K问题（如果需要用于输出的空间，则可能更少）。

The first important thing to consider is whether you can parallelise each problem, ie is there a way to assign multiple threads to a single problem? 首先要考虑的是您是否可以并行处理每个问题，即是否可以将多个线程分配给一个问题？ If so then you should look at assigning a block of threads to solve an individual problem, using shared memory may be worth considering, although you would be limiting your occupancy to two blocks per SM which may hurt performance. 如果是这样，那么您应该考虑分配一个线程块来解决一个单独的问题，使用共享内存可能值得考虑，尽管您可能会将占用率限制为每个SM两个块，这可能会损害性能。

If you cannot parallelise within a problem, then you should not be considering shared memory since, as you say, you would be limiting yourself to two threads per SM (fundamentally eliminating the benefit of GPU computing). 如果您无法在问题中并行化，那么您就不应该考虑共享内存，因为正如您所说，您将自己限制为每个SM两个线程（从根本上消除了GPU计算的好处）。 Instead you would need to ensure that the data layout in global memory is such that you can achieve coalesced accesses - this most likely means using an SoA (struct of arrays) layout instead of AoS (array of structs). 相反，您需要确保全局内存中的数据布局能够实现合并访问-这很可能意味着使用SoA（数组结构）布局而不是AoS（结构数组）。

Your second question is a little confusing, it's not clear if you mean "block" in the GPU context or in the problem context. 您的第二个问题有些令人困惑，尚不清楚您是在GPU上下文中还是在问题上下文中表示“阻塞”。 However fundamentally if you declare a __shared__ array of 20 KB in your kernel code then that array will be allocated once per block and each block will have the same base-address. 但是从根本上说，如果您在内核代码中声明一个20 KB的__shared__数组，则该数组将在每个块中分配一次，并且每个块将具有相同的基地址。

Update following OP's comments 根据OP的评论进行更新

The GPU contains a number of SMs, and each SM has a small physical memory which is used both for the L1 and shared memory. GPU包含许多SM，每个SM都有一个较小的物理内存，可用于L1和共享内存。 In your case, K10, each SM has 48 KB available for use as shared memory, meaning that all the blocks executing on the SM at any time can use up to 48 KB between them. 在您的情况下，对于K10，每个SM都有48 KB可用作为共享内存，这意味着在SM上执行的所有块在任何时候都可以使用多达48 KB的内存。 Since you need 20 KB per block, you can have a maximum of two blocks executing on the SM at any time. 由于每个块需要20 KB，因此您可以随时在SM上最多执行两个块。 This doesn't affect how many blocks you can set in your launch configuration, it merely affects how they are scheduled. 这并不影响您可以在启动配置中设置多少个块，而仅影响它们的调度方式。 This answer talks in a bit more detail (albeit for a device with 16 KB per SM) and this (very old) answer explains a little more, although probably the most helpful (and up-to-date) info is on the CUDA education pages . 这个答案讨论的内容更为详细（尽管对于每个SM 16 KB的设备而言），这个（很旧的）答案的解释要多一些，尽管最有用（最新）的信息在CUDA教育中页数。

如果我使用共享内存，可以分配多少块？

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-05-13 09:53:17

如果我使用共享内存，可以分配多少块？

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-05-13 09:53:17

解决方案1
3 已采纳 2015-05-13 09:53:17