如何像C ++ const / constexpr一样定义CUDA设备常量？

Question

In a .cu file I've tried the following in the global scope (ie not in a function): 在.cu文件中，我在全局范围内尝试了以下内容（即不在函数中）：

__device__ static const double cdInf = HUGE_VAL / 4;

And got nvcc error: 并得到nvcc错误：

error : dynamic initialization is not supported for __device__, __constant__ and __shared__ variables.

How to define a C++ const/constexpr on the device, if that's possible? 如果可能的话，如何在设备上定义C ++ const / constexpr？

NOTE1: #define is out of question not only for aesthetic reasons, but also because in practice the expression is more complex and involves an internal data type, not just double. 注1： #define不仅仅是出于美学原因，而且因为在实践中表达式更复杂并且涉及内部数据类型，而不仅仅是双重。 So calling the constructor each time in each CUDA thread would be too expensive. 因此，每次在每个CUDA线程中调用构造函数都会太昂贵。

NOTE2: I doubt the performance of __constant__ because it's not a compile-time constant, but rather like a variable written with cudaMemcpyToSymbol . 注2：我怀疑__constant__的性能，因为它不是编译时常量，而是像用cudaMemcpyToSymbol编写的变量。

Answer 1

Use a constexpr __device__ function: 使用constexpr __device__函数：

#include <stdio.h>
__device__ constexpr double cdInf() { return HUGE_VAL / 4; }
__global__ void print_cdinf() { printf("in kernel, cdInf() is %lf\n", cdInf()); }
int main() { print_cdinf<<<1, 1>>>(); return 0; }

The PTX should be something like: PTX应该是这样的：

.visible .entry print_cdinf()(

)
{
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .b32       %r<2>;
        .reg .b64       %rd<7>;


        mov.u64         %rd6, __local_depot0;
        cvta.local.u64  %SP, %rd6;
        add.u64         %rd1, %SP, 0;
        cvta.to.local.u64       %rd2, %rd1;
        mov.u64         %rd3, 9218868437227405312;
        st.local.u64    [%rd2], %rd3;
        mov.u64         %rd4, $str;
        cvta.global.u64         %rd5, %rd4;
        // Callseq Start 0
        {
        .reg .b32 temp_param_reg;
        // <end>}
        .param .b64 param0;
        st.param.b64    [param0+0], %rd5;
        .param .b64 param1;
        st.param.b64    [param1+0], %rd1;
        .param .b32 retval0;
        call.uni (retval0), 
        vprintf, 
        (
        param0, 
        param1
        );
        ld.param.b32    %r1, [retval0+0];

        //{
        }// Callseq End 0
        ret;
}

With no code for the constexpr function. 没有constexpr功能的代码。 You could also use a constexpr __host__ function, but that's experimental in CUDA 7: use the nvcc command-line options seems to be --expt-relaxed-constexpr and see here for more details (thanks @harrism). 您也可以使用constexpr __host__函数，但这在CUDA 7中是实验性的：使用nvcc命令行选项似乎是--expt-relaxed-constexpr并在此处查看更多详细信息（感谢@harrism）。

Answer 2

To make the code you have shown compile and work as expected, you need to initialize the variable at runtime, not compile time. 要使您显示的代码按预期编译和工作，您需要在运行时初始化变量，而不是编译时。 To do this, add a host side call to cudaMemcpyToSymbol , something like: 要执行此操作，请向cudaMemcpyToSymbol添加主机端调用，例如：

__device__ double cdInf;

// ...

double val = HUGE_VAL / 4
cudaMemcpyToSymbol(cdInf, &val, sizeof(double));

However, for a single value, passing it as a kernel argument would seem far more sensible. 但是，对于单个值，将其作为内核参数传递似乎更为合理。 The compiler will automagically store the argument in constant memory on all supported architectures, and there is a "free" constant cache broadcast mechanism which should make the cost of accessing the value at runtime negligible. 编译器将自动将参数存储在所有支持的体系结构的常量内存中，并且存在“自由”常量高速缓存广播机制，这应该使得在运行时访问该值的成本可以忽略不计。

Answer 3

To initialize it you have to use cudaMemcpyToSymbol . 要初始化它，您必须使用cudaMemcpyToSymbol 。 It is not a compile time constant but stored in the constant memory of the device and has some advantages over global memory. 它不是编译时常量，而是存储在器件的常量存储器中，并且与全局存储器相比具有一些优势。 From the CUDA blogspot: 来自CUDA blogspot：

For all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address. 对于半warp的所有线程，只要所有线程读取相同的地址，从常量高速缓存读取的速度与从寄存器读取的速度一样快。 Accesses to different addresses by threads within a half warp are serialized, so cost scales linearly with the number of different addresses read by all threads within a half warp. 半个warp中的线程对不同地址的访问被序列化，因此成本与半warp中所有线程读取的不同地址的数量成线性比例。

You do not need to use const , and you cannot use it. 您不需要使用const ，也不能使用它。 It is not a c++ constant since you need to modify it through cudaMemcpyToSymbol . 它不是c ++常量，因为您需要通过cudaMemcpyToSymbol修改它。 So it is not a "real" constant at least from the c++ point of view. 因此，至少从c ++的角度来看，它不是一个“真正的”常数。 But it behaves like a constant inside the device kernels because you can modify it only through cudaMemcpyToSymbol which is callable only from host. 但它在设备内核中的行为类似于常量，因为您只能通过仅可从主机调用的cudaMemcpyToSymbol来修改它。

如何像C ++ const / constexpr一样定义CUDA设备常量？

问题描述

3 个解决方案

解决方案1
6 已采纳 2016-09-12 12:31:13

解决方案2
2

解决方案3
0 2016-09-12 09:20:46

如何像C ++ const / constexpr一样定义CUDA设备常量？

问题描述

3 个解决方案

解决方案1 6 已采纳 2016-09-12 12:31:13

解决方案2 2

解决方案3 0 2016-09-12 09:20:46

解决方案1
6 已采纳 2016-09-12 12:31:13

解决方案2
2

解决方案3
0 2016-09-12 09:20:46