CUDA 中设备函数指针的分配（来自主机函数指针）

Question

I encapsulate my function pointers in a structure/class.我将我的函数指针封装在一个结构/类中。 I can use these functions in a CPU implementation easily.我可以轻松地在 CPU 实现中使用这些功能。 However, if I want to use function pointers in CUDA, I have to register these functions by CUDA directives.但是，如果我想在 CUDA 中使用函数指针，我必须通过 CUDA 指令注册这些函数。 Unfortuanely, here things become tricky.不幸的是，这里的事情变得棘手。 What I want is to create and device function pointers from a class containing function pointers.我想要的是从包含函数指针的类创建和设备函数指针。

But let's start with the structure:但让我们从结构开始：

#ifndef TRANSFERFUNCTIONS_H_
#define TRANSFERFUNCTIONS_H_

#ifndef SWIG
#include <cmath>
#include <stdio.h>
#include <string.h>
#endif

#define PI    3.14159265358979323846f 



typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);


//////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_gaussian_nhood (float dist, float sigmaT) {
        return exp(-pow(dist, 2.f)/(2.f*pow(sigmaT, 2.f)));
}

#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_rad_decay (float sigma0, float T, float lambda) {
        return std::floor(sigma0*exp(-T/lambda) + 0.5f);
}

    //////////////////////////////////////////////////////////////////////////////////////////////
#ifdef __CUDACC__
        __host__ __device__
#endif
inline static float
fcn_lrate_decay (float sigma0, float T, float lambda) {
        return sigma0*exp(-T/lambda);
}

class DistFunction;
typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);
typedef float (DistFunction::*pmDistanceFu) (float, float);
typedef float (DistFunction::*pmDecayFu) (float, float, float);


class DistFunction {
private:
        pDistanceFu hDist;
        pDecayFu hRadDecay; 
        pDecayFu hLRateDecay;

public:
        DistFunction(char *, pDistanceFu, pDecayFu, pDecayFu);
        void Assign();

        char *name;
        pDistanceFu distance;
        pDecayFu rad_decay;
        pDecayFu lrate_decay;
};

void test();

#endif /* TRANSFERFUNCTIONS_H_ */

Implementation:执行：

//#include <iostream>
#include "Functions.h"
#include <iostream>
#include <thrust/extrema.h>
#include <thrust/distance.h>
#include <thrust/device_vector.h>


DistFunction::DistFunction(char *cstr, pDistanceFu dist, pDecayFu rad, pDecayFu lrate) : name(cstr), distance(dist), rad_decay(rad), lrate_decay(lrate) {
}

void DistFunction::Assign() {
        pDistanceFu hDist;
        pDecayFu hRadDecay; 
        pDecayFu hLRateDecay;

        cudaMemcpyFromSymbol(&hDist, distance, sizeof(pDistanceFu) );
        cudaMemcpyFromSymbol(&hRadDecay, rad_decay, sizeof(pDecayFu) );
        cudaMemcpyFromSymbol(&hLRateDecay, lrate_decay, sizeof(pDecayFu) );

        distance = hDist;
        rad_decay = hRadDecay;
        lrate_decay = hLRateDecay;
}

DistFunction fcn_gaussian = DistFunction(
        (char*)"gaussian",
        fcn_gaussian_nhood,
        fcn_rad_decay,
        fcn_lrate_decay
);



struct sm20lrate_decay_functor {
        float fCycle;
        float fCycles;
        DistFunction m_pfunc;

        sm20lrate_decay_functor(const DistFunction &pfunc, float cycle, float cycles) : m_pfunc(pfunc), fCycle(cycle), fCycles(cycles) {}

        __host__ __device__
        float operator()(float lrate) {
                return (m_pfunc.lrate_decay)(lrate, fCycle, fCycles);
        }
};

void test() {
        unsigned int iWidth     = 4096;
        thrust::device_vector<float> dvLearningRate(iWidth, 0.f);
        thrust::device_vector<float> dvLRate(iWidth, 0.f);

        thrust::transform( dvLRate.begin(),
                dvLRate.end(),
                dvLearningRate.begin(),
                sm20lrate_decay_functor(fcn_gaussian, 1, 100) );
}

Edit: Made a minimal example.编辑：做了一个最小的例子。

It seems that CUDA device function pointers are useless, because I cannot use them dynamically. CUDA 设备函数指针似乎没用，因为我不能动态使用它们。 For what they have been implemented remains enigmatic for me.因为它们的实施对我来说仍然是个谜。 Can it be that CUDA is not really supporting function pointers but just uses function references in a similar way? CUDA 是否真的不支持函数指针，而只是以类似的方式使用函数引用？

Answer 1

Question is not explicit enough.问题不够明确。 I will try to rephrase: Is it possible to get the function pointer of a device function from the host without using an intermediate globally declared variable ?我将尝试改写：是否可以从主机获取设备函数的函数指针而不使用中间全局声明变量？

This is possible, though not exactlty the way you express it.这是可能的，尽管与您表达它的方式不完全相同。

First, in your code sample, the function is marked inline static, hence, if CUDA does not see any use for its address, the function will most probably get inlined, and getting a pointer to it will not be feasible.首先，在您的代码示例中，该函数被标记为 inline static，因此，如果 CUDA 没有发现其地址有任何用处，则该函数很可能会被内联，并且无法获得指向它的指针。

Second, you do not document what GetDistFunction() returns, hence, we don't know what symbol it returns.其次，您没有记录 GetDistFunction() 返回什么，因此，我们不知道它返回什么符号。

The method you are using, cudaMemcpyFromSymbol is returning您正在使用的方法cudaMemcpyFromSymbol正在返回

symbol is a variable that resides in global or constant memory space.符号是驻留在全局或常量内存空间中的变量。

The function pointer symbol is not a variable, it is a pointer to a code region.函数指针符号不是变量，而是指向代码区域的指针。 Also, GetDistFunction()->xxx is unlikely to be a symbol.此外， GetDistFunction()->xxx不太可能是一个符号。

The technique you use is one of the approach to do the operation you intend.您使用的技术是执行您想要的操作的方法之一。 You may also initialize your structure on the device where getting the function pointer is as trivial as it is on host side.您还可以在设备上初始化您的结构，其中获取函数指针与在主机端一样简单。 That way, your code will get simpler with no call to cudaMemcpyToSymbol nor global variable holding the pointer.这样，您的代码将变得更简单，无需调用 cudaMemcpyToSymbol 或保存指针的全局变量。 Here is a snippet of code illustrating both approaches, the second one avoiding use of intermediate global-scope variable:这是说明这两种方法的代码片段，第二个避免使用中间全局范围变量：

typedef int (*funcptr) ();

__device__ int f() { return 42 ; }

__device__ funcptr f_ptr = f ;

__global__ void kernel ( funcptr func )
{
    int k = func () ;
    printf ("%d\n", k) ;

    funcptr func2 = f ; // does not use a global-scope variable
    printf ("%d\n", func2()) ;
}


int main ()
{
    funcptr h_funcptr ;

    if (cudaSuccess != cudaMemcpyFromSymbol (&h_funcptr, f_ptr, sizeof (funcptr)))
        printf ("FAILED to get SYMBOL\n");

    kernel <<<1,1>>> (h_funcptr) ;
    if (cudaDeviceSynchronize() != cudaSuccess)
        printf ("FAILED\n");
    else
        printf ("SUCCEEDED\n");
}

Finally, as a design comment, you may want to try use virtual functions, and build the appropriate instance of your class on the device, all these initialization steps would be generated by the compiler, here is an example:最后，作为设计注释，您可能想尝试使用虚函数，并在设备上构建您的类的适当实例，所有这些初始化步骤都将由编译器生成，这是一个示例：

class T
{
public:
    virtual __device__ int f() const = 0 ;
} ;

class G : public T
{
public:
    virtual __device__ int f() const { return 42; }
} ;

__global__ void kernel2 ()
{
    T* t = new G() ;
    int k = t->f();
    printf ("%d\n", k) ;
}

int main ()
{
    kernel2<<<1,1>>>();
    if (cudaDeviceSynchronize() != cudaSuccess)
        printf ("FAILED\n");
    return 0 ;
}

And using the prototype pattern or a singleton would help.使用原型模式或单例会有所帮助。

Answer 2

I finally found out that my example posted in the question is impossible to realize with device function pointers, because function pointers can neither be assigned outside main space (eg constructor), nor dynamically.我终于发现我在问题中发布的示例无法用设备函数指针实现，因为函数指针既不能在主空间（例如构造函数）之外分配，也不能动态分配。

Function wise, this demo implementation of CUDA function pointers corresponds to the example below.功能方面，这个 CUDA 函数指针的演示实现对应于下面的示例。

typedef int (*funcptr) ();

__device__ int f() { return 42 ; }

__device__ funcptr f_ptr = f ;

__global__ void kernel ( funcptr func )
{
    int k = func () ;
    printf ("%d\n", k) ;

    funcptr func2 = f ; // does not use a global-scope variable
    printf ("%d\n", func2()) ;
}


int main ()
{
    funcptr h_funcptr ;

    if (cudaSuccess != cudaMemcpyFromSymbol (&h_funcptr, f_ptr, sizeof (funcptr)))
        printf ("FAILED to get SYMBOL\n");

    kernel <<<1,1>>> (h_funcptr) ;
    if (cudaDeviceSynchronize() != cudaSuccess)
        printf ("FAILED\n");
    else
        printf ("SUCCEEDED\n");
}

As one can clearly see, the example above has no gain in flexibility, as every function pointer must be assigned to a device symbol in global space.可以清楚地看到，上面的例子没有增加灵活性，因为每个函数指针都必须分配给全局空间中的一个设备符号。

__device__ int f() { return 42 ; }

__global__ void kernel () {
    int k = f() ;
    printf ("%d\n", k) ;
}

int main ()
{
    kernel <<<1,1>>> () ;
    if (cudaDeviceSynchronize() != cudaSuccess)
        printf ("FAILED\n");
    else
        printf ("SUCCEEDED\n");
}

The only way to circumvent the absolutely useless device function implementation of NVIDIA, which cannot yield any benefit over normal function calls (because of the named reasons), is to use templates.绕过 NVIDIA 绝对无用的设备函数实现的唯一方法是使用模板，该实现不会比正常函数调用产生任何好处（由于命名原因）。 Unfortunately, templates allow no run-time flexibility.不幸的是，模板没有运行时的灵活性。 Nevertheless, this is no disadavnatge in comparison to CUDA device function pointers , because they also don't allow runtime change of functions.尽管如此，与 CUDA 设备函数指针相比，这并不是什么缺点，因为它们也不允许运行时更改函数。

This is my template based solution for the problem illustrated above.这是我针对上述问题的基于模板的解决方案。 It may look like strong opinion against CUDA device function pointers, but if someone can prove me wrong, he can post an example ..它可能看起来像是对 CUDA 设备函数指针的强烈反对，但如果有人能证明我是错的，他可以发布一个例子..

typedef float (*pDistanceFu) (float, float);
typedef float (*pDecayFu) (float, float, float);

template <pDistanceFu Dist, pDecayFu Rad, pDecayFu LRate>
class DistFunction {    
public:
        DistFunction() {}
        DistFunction(const char *cstr) : name(cstr) {};

        const char *name;

        #ifdef __CUDACC__
                __host__ __device__
        #endif
        static float distance(float a, float b) { return Dist(a,b); };
        #ifdef __CUDACC__
                __host__ __device__
        #endif
        static float rad_decay(float a, float b, float c) { return Rad(a,b,c); };
        #ifdef __CUDACC__
                __host__ __device__
        #endif
        static float lrate_decay(float a, float b, float c) { return LRate(a,b,c); };
};

And an example:和一个例子：

template <class F>
struct functor {
float fCycle;
float fCycles;

functor(float cycle, float cycles) : fCycle(cycle), fCycles(cycles) {}

__host__ __device__
float operator()(float lrate) {
    return F::lrate_decay(lrate, fCycle, fCycles);
}
};

typedef DistFunction<fcn_gaussian_nhood,fcn_rad_decay,fcn_lrate_decay> gaussian;
void test() {
        functor<gaussian> test(0,1);
}

CUDA 中设备函数指针的分配（来自主机函数指针）

问题描述

2 个解决方案

解决方案1
1 2016-04-30 12:16:00

解决方案2
-1 已采纳 2016-05-04 09:49:35

CUDA 中设备函数指针的分配（来自主机函数指针）

问题描述

2 个解决方案

解决方案1 1 2016-04-30 12:16:00

解决方案2 -1 已采纳 2016-05-04 09:49:35

解决方案1
1 2016-04-30 12:16:00

解决方案2
-1 已采纳 2016-05-04 09:49:35