简体   繁体   English


[英]CUDA device function pointers in structure without static pointers or symbol copies

My intended program flow would look like the following if it were possible: 如果可能的话,我预期的程序流程如下所示:

typedef struct structure_t
  /* device function pointer. */
  __device__ float (*function_pointer)(float, float, float[]);
} structure;


/* function to be assigned. */
__device__ float
my_function (float a, float b, float c[])
  /* do some stuff on the device. */

some_structure_initialization_function (structure *st)
  /* assign. */
  st->function_pointer = my_function;

This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure. 这是不可能的,并且在编译期间有关__device__在结构中的位置的错误结束。

 error: attribute "device" does not apply here

There are some examples of similar types of problems here on stackoverflow, but they all involve the use of static pointers outside the structure. 在stackoverflow上有一些类似类型的问题的示例,但它们都涉及在结构外部使用静态指针。 Examples are device function pointers as struct members and device function pointers . 示例是作为结构成员的 设备函数指针设备函数指针 I've taken a similar approach with success previously in other codes where it's easy for me to use static device pointers and define them outside of any structures. 以前,我在其他代码中也采用了类似的方法,但在其他代码中却很成功,因为我很容易使用静态设备指针并在任何结构外部定义它们。 Currently though this is a problem. 目前,尽管这是一个问题。 It's written as an API of sorts and the user may define one or two or dozens of structures which need to include a device function pointer. 它以各种API的形式编写,用户可以定义一两个或几十个需要包含设备功能指针的结构。 So, defining static device pointers outside of the structure is a major problem. 因此,在结构外部定义静态设备指针是一个主要问题。

I'm fairly certain the answer exists within the posts I have linked above, through use symbol copies, but I've not been able to put them to successful use. 我相当确定答案是通过使用符号副本在我上面链接的帖子中存在的,但我无法使其成功使用。

What you are trying to do is possible, but you have made a few mistakes in the way you are declaring and defining the structures that will hold and use the function pointer. 尝试执行的操作可能的,但是在声明和定义将保存并使用函数指针的结构时,您犯了一些错误。

This is not possible, and ends in a familiar error during compilation regarding the placement of __device__ in the structure. 这是不可能的,并且在编译期间有关__device__在结构中的位置的错误结束。

  error: attribute "device" does not apply here 

This is only because you are attempting to assign a memory space to a structure or class data member, which is illegal in CUDA. 这仅是因为您试图将存储空间分配给结构或类数据成员,这在CUDA中是非法的。 The memory space of the all class or structure data members are implicitly set when you define or instantiate a class. 当定义或实例化一个类时,所有类或结构数据成员的存储空间都是隐式设置的。 So something only slighlty different (and more concrete): 所以只有一点不同(更具体):

typedef float (* fp)(float, float, float4);

struct functor
    float c0, c1;
    fp f;

    __device__ __host__
    functor(float _c0, float _c1, fp _f) : c0(_c0), c1(_c1), f(_f) {};

    __device__ __host__
    float operator()(float4 x) { return f(c0, c1, x); };

void kernel(float c0, float c1, fp f, const float4 * x, float * y, int N)
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    struct functor op(c0, c1, f);
    for(int i = tid; i < N; i += blockDim.x * gridDim.x) {
        y[i] = op(x[i]);

is perfectly valid. 是完全有效的。 The function pointer fp in functor is implicitly a __device__ function when an instance of functor is instantiated in device code. 当在设备代码中实例化functor实例时, functor的函数指针fp隐式是__device__函数。 If it were instantiated in host code, the function pointer would implicitly be a host function. 如果它是在宿主代码中实例化的,则函数指针将隐式为宿主函数。 In the kernel, a device function pointer passed as argument is used to instantiate a functor instance. 在内核中,使用作为参数传递的设备函数指针来实例化functor实例。 All perfectly legal. 完全合法。

I believe I am correct in saying that there is no direct way to get the address of a __device__ function in host code, so you still require some static declarations and symbol manipulation. 我相信我的说法是正确的,因为没有直接方法可以在主机代码中获取__device__函数的地址,因此您仍然需要一些静态声明和符号操作。 This might be different in CUDA 5, but I have not tested it to see. 这在CUDA 5中可能有所不同,但我尚未对其进行测试。 If we flesh out the device code above with a couple of __device__ functions and some supporting host code: 如果我们通过几个__device__函数和一些支持的主机代码__device__以上的设备代码:

__device__ __host__ 
float f1 (float a, float b, float4 c)
    return a + (b * c.x) +  (b * c.y) + (b * c.z) + (b * c.w);

__device__ __host__
float f2 (float a, float b, float4 c)
    return a + b + c.x + c.y + c.z + c.w;

__constant__ fp function_table[] = {f1, f2};

int main(void)
    const float c1 = 1.0f, c2 = 2.0f;
    const int n = 20;
    float4 vin[n];
    float vout1[n], vout2[n];
    for(int i=0, j=0; i<n; i++) {
        vin[i].x = j++; vin[i].y = j++;
        vin[i].z = j++; vin[i].w = j++;

    float4 * _vin;
    float * _vout1, * _vout2;
    size_t sz4 = sizeof(float4) * size_t(n);
    size_t sz1 = sizeof(float) * size_t(n);
    cudaMalloc((void **)&_vin, sz4);
    cudaMalloc((void **)&_vout1, sz1);
    cudaMalloc((void **)&_vout2, sz1);
    cudaMemcpy(_vin, &vin[0], sz4, cudaMemcpyHostToDevice);

    fp funcs[2];
    cudaMemcpyFromSymbol(&funcs, "function_table", 2 * sizeof(fp));

    kernel<<<1,32>>>(c1, c2, funcs[0], _vin, _vout1, n);
    cudaMemcpy(&vout1[0], _vout1, sz1, cudaMemcpyDeviceToHost); 

    kernel<<<1,32>>>(c1, c2, funcs[1], _vin, _vout2, n);
    cudaMemcpy(&vout2[0], _vout2, sz1, cudaMemcpyDeviceToHost); 

    struct functor func1(c1, c2, f1), func2(c1, c2, f2); 
    for(int i=0; i<n; i++) {
        printf("%2d %6.f %6.f (%6.f,%6.f,%6.f,%6.f ) %6.f %6.f %6.f %6.f\n", 
                i, c1, c2, vin[i].x, vin[i].y, vin[i].z, vin[i].w,
                vout1[i], func1(vin[i]), vout2[i], func2(vin[i]));

    return 0;

you get a fully compilable and runnable example. 您将得到一个完全可编译且可运行的示例。 Here two __device__ functions and a static function table provide a mechanism for the host code to retrieve __device__ function pointers at runtime. 这里,两个__device__函数和一个静态函数表为主机代码提供了一种机制,使主机代码可以在运行时检索__device__函数指针。 The kernel is called once with each __device__ function and the results displayed, along with the exact same functor and functions instantiated and called from host code (and thus running on the host) for comparison: 每个__device__函数都会调用一次内核,并显示结果以及从主机代码实例化并调用的完全相同的函子和函数(并因此在主机上运行)以进行比较:

$ nvcc -arch=sm_30 -Xptxas="-v" -o function_pointer function_pointer.cu 

ptxas info    : Compiling entry function '_Z6kernelffPFfff6float4EPKS_Pfi' for 'sm_30'
ptxas info    : Function properties for _Z6kernelffPFfff6float4EPKS_Pfi
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z2f1ff6float4
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _Z2f2ff6float4
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 356 bytes cmem[0], 16 bytes cmem[3]

$ ./function_pointer 
 0      1      2 (     0,     1,     2,     3 )     13     13      9      9
 1      1      2 (     4,     5,     6,     7 )     45     45     25     25
 2      1      2 (     8,     9,    10,    11 )     77     77     41     41
 3      1      2 (    12,    13,    14,    15 )    109    109     57     57
 4      1      2 (    16,    17,    18,    19 )    141    141     73     73
 5      1      2 (    20,    21,    22,    23 )    173    173     89     89
 6      1      2 (    24,    25,    26,    27 )    205    205    105    105
 7      1      2 (    28,    29,    30,    31 )    237    237    121    121
 8      1      2 (    32,    33,    34,    35 )    269    269    137    137
 9      1      2 (    36,    37,    38,    39 )    301    301    153    153
10      1      2 (    40,    41,    42,    43 )    333    333    169    169
11      1      2 (    44,    45,    46,    47 )    365    365    185    185
12      1      2 (    48,    49,    50,    51 )    397    397    201    201
13      1      2 (    52,    53,    54,    55 )    429    429    217    217
14      1      2 (    56,    57,    58,    59 )    461    461    233    233
15      1      2 (    60,    61,    62,    63 )    493    493    249    249
16      1      2 (    64,    65,    66,    67 )    525    525    265    265
17      1      2 (    68,    69,    70,    71 )    557    557    281    281
18      1      2 (    72,    73,    74,    75 )    589    589    297    297
19      1      2 (    76,    77,    78,    79 )    621    621    313    313

If I have understood your question correctly, the above example should give you pretty much all the design patterns you need to implement your ideas in device code. 如果我正确理解了您的问题,那么上面的示例将为您提供在设备代码中实现想法所需的几乎所有设计模式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM