具有动态共享内存的模板化 CUDA 内核

Question

I want to call different instantiations of a templated CUDA kernel with dynamically allocated shared memory in one program.我想在一个程序中使用动态分配的共享内存调用模板化CUDA 内核的不同实例。 My first naive approach was to write:我第一个天真的方法是这样写：

template<typename T>
__global__ void kernel(T* ptr)
{
  extern __shared__ T smem[];
  // calculations here ...                                                                                                                                          
}

template<typename T>
void call_kernel( T* ptr, const int n )
{
  dim3 dimBlock(n), dimGrid;
  kernel<<<dimGrid, dimBlock, n*sizeof(T)>>>(ptr);
}

int main(int argc, char *argv[])
{
  const int n = 32;
  float *float_ptr;
  double *double_ptr;
  cudaMalloc( (void**)&float_ptr, n*sizeof(float) );
  cudaMalloc( (void**)&double_ptr, n*sizeof(double) );

  call_kernel( float_ptr, n );
  call_kernel( double_ptr, n ); // problem, 2nd instantiation

  cudaFree( (void*)float_ptr );
  cudaFree( (void*)double_ptr );
  return 0;
}

However, this code cannot be compiled.但是，无法编译此代码。 nvcc gives me the following error message: nvcc 给了我以下错误信息：

main.cu(4): error: declaration is incompatible with previous "smem"
(4): here
          detected during:
            instantiation of "void kernel(T *) [with T=double]"
(12): here
            instantiation of "void call_kernel(T *, int) [with T=double]"
(24): here

I understand that I am running into a name conflict because the shared memory is declared as extern.我知道我遇到了名称冲突，因为共享内存被声明为 extern。 Nevertheless there is no way around that if I want to define its size during runtime, as far as I know.然而，据我所知，如果我想在运行时定义它的大小，就没有办法解决这个问题。

So, my question is: Is there any elegant way to obtain the desired behavior?所以，我的问题是：是否有任何优雅的方式来获得所需的行为？ With elegant I mean without code duplication etc.优雅我的意思是没有代码重复等。

Answer 1

Dynamically allocated shared memory is really just a size (in bytes) and a pointer being set up for the kernel.动态分配的共享内存实际上只是一个大小（以字节为单位）和一个为内核设置的指针。 So something like this should work:所以这样的事情应该有效：

replace this:替换这个：

extern __shared__ T smem[];

with this:有了这个：

extern __shared__ __align__(sizeof(T)) unsigned char my_smem[];
T *smem = reinterpret_cast<T *>(my_smem);

You can see other examples of re-casting of dynamically allocated shared memory pointers in the programming guide which can serve other needs.您可以在编程指南中看到重新转换动态分配的共享内存指针的其他示例，这些示例可以满足其他需求。

EDIT: updated my answer to reflect the comment by @njuffa.编辑：更新我的答案以反映@njuffa 的评论。

Answer 2

_{(A variation on @RobertCrovella's answer )} _{（@RobertCrovella答案的变体）}

NVCC is not willing to accept two extern __shared__ arrays of the same name but different types - even if they're never in each other's scope. NVCC 不愿意接受两个同名但类型不同的extern __shared__数组——即使它们从来不在彼此的范围内。 We'll need to satisfy NVCC by having our template instances all use the same type for the shared memory under the hood, while letting the kernel code using them see the type it likes.我们需要通过让我们的模板实例都使用相同类型的共享内存来满足 NVCC，同时让使用它们的内核代码看到它喜欢的类型。

So we replace this instruction:所以我们替换这个指令：

extern __shared__ T smem[];

with this one:有了这个：

auto smem = shared_memory_proxy<T>();

where:在哪里：

template <typename T>
__device__ T* shared_memory_proxy()
{
    // do we need an __align__() here? I don't think so...
    extern __shared__ unsigned char memory[];
    return reinterpret_cast<T*>(memory);
}

is in some device-side code include file.在一些设备端代码包含文件中。

Advantages:好处：

One-liner at the site of use.使用现场的单衬。
Simpler syntax to remember.更简单的语法要记住。
Separation of concerns - whoever reads the kernel doesn't have to think about why s/he's seeing extern , or alignment specifiers, or a reinterpret cast etc.关注点分离 - 阅读内核的人不必考虑为什么他/她会看到extern ，或对齐说明符，或重新解释强制转换等。

edit : This is implemented as part of my CUDA kernel author's tools header-only library: shared_memory.cuh (where it's named shared_memory::dynamic::proxy() ).编辑：这是作为我的CUDA 内核作者的工具头文件库的一部分实现的： shared_memory.cuh （它被命名为shared_memory::dynamic::proxy() ）。

具有动态共享内存的模板化 CUDA 内核

问题描述

2 个解决方案

解决方案1
15 已采纳 2014-12-19 17:12:01

解决方案2
4 2018-03-11 20:05:20

具有动态共享内存的模板化 CUDA 内核

问题描述

2 个解决方案

解决方案1 15 已采纳 2014-12-19 17:12:01

解决方案2 4 2018-03-11 20:05:20

解决方案1
15 已采纳 2014-12-19 17:12:01

解决方案2
4 2018-03-11 20:05:20