我的内核代码可以告诉它有多少共享内存吗？

Question

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid? 正在运行的设备端CUDA代码是否可能知道为正在运行的内核网格的每个块分配了多少（静态和/或动态）共享内存？

On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; 在主机方面，您知道启动的内核拥有（或将要拥有）多少共享内存，因为您可以自己设置该值。 but what about the device side? 但是设备方面呢？ It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. 可以很容易地将上限编译为该大小，但是该信息对于设备不可用（除非显式传递）。 Is there an on-GPU mechanism for obtaining it? 是否有一种GPU上的获取机制？ The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory). CUDA C编程指南似乎并未讨论此问题（在共享内存部分中或之外）。

Answer 1

TL;DR: Yes. TL; DR：是的。 Use the function below. 使用以下功能。

It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size . 有可能：该信息可用于特殊寄存器%dynamic_smem_size和%total_smem_size的内核代码。

Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. 通常，当我们编写内核代码时，我们不需要知道特定的寄存器（特殊寄存器或其他寄存器），而是编写C / C ++代码。 Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. 即使我们确实使用了这些寄存器，CUDA编译器也会通过保存其值的函数或结构将其隐藏起来。 For example, when we use the value threadIdx.x , we are actually accessing the special register %tid.x , which is set differently for every thread in the block. 例如，当我们使用值threadIdx.x ，实际上是在访问特殊寄存器%tid.x ，该寄存器对块中的每个线程都进行了不同的设置。 You can see these registers "in action" when you look at compiled PTX code. 当您查看编译的PTX代码时，可以看到这些寄存器“起作用”。 ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code . ArrayFire撰写了一篇不错的博客文章，其中包含一些可行的示例：揭秘PTX代码。

But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those % -prefixed names? 但是，如果CUDA编译器“隐藏”了我们的使用记录，我们如何才能走出这个帷幕，而实际上坚持使用它们，并以这些带有%前缀的名称进行访问？ Well, here's how: 好吧，这是怎么做的：

__forceinline__ __device__ unsigned dynamic_smem_size()
{
    unsigned ret; 
    asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
    return ret;
}

and a similar function for %total_smem_size . 和%total_smem_size的类似函数。 This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. 此功能使编译器添加一个显式的PTX指令，就像asm可用于主机代码直接发出CPU汇编指令一样。 This function should always be inlined, so when you assign 此功能应始终内联，因此在分配时

x = dynamic_smem_size();

you actually just assign the value of the special register to x . 您实际上只是将特殊寄存器的值分配给x 。

我的内核代码可以告诉它有多少共享内存吗？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-02-17 23:52:54

TL;DR: Yes. TL; DR：是的。 Use the function below. 使用以下功能。

我的内核代码可以告诉它有多少共享内存吗？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-02-17 23:52:54

TL;DR: Yes. TL; DR：是的。 Use the function below. 使用以下功能。

解决方案1
2 已采纳 2017-02-17 23:52:54