CUDA：如何从主机函数返回设备lambda

Question

I have a virtual function which returns a different lambda depending on the derived class: 我有一个虚函数，它根据派生类返回不同的lambda：

class Base
{
public:
    virtual std::function<float()> foo(void) = 0;
};

class Derived : public Base
{
public:
    std::function<float()> foo(void) {
        return [] __device__ (void) {
            return 1.0f;
        };
    }
};

Then I want to pass this lambda to a CUDA kernel and call it from the device. 然后我想将这个lambda传递给CUDA内核并从设备中调用它。 In other words, I want to do this: 换句话说，我想这样做：

template<typename Func>
__global__ void kernel(Func f) {
    f();
}

int main(int argc, char** argv)
{
    Base* obj = new Derived;
    kernel<<<1, 1>>>(obj->foo());
    cudaDeviceSynchronize();
    return 0;
}

Tha above give an error like this: calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed 上面给出了这样的错误： calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed

As you can see, I declare my lambda as __device__ , but the foo() method stores it in a std::function in order to return it. 如您所见，我将lambda声明为__device__ ，但foo()方法将其存储在std::function中以便返回它。 As a result, what is passed to the kernel() is a host address and of course it does not work. 因此，传递给kernel()是主机地址，当然它不起作用。 I guess that is my problem, right? 我猜这是我的问题，对吗？ So my questions are: 所以我的问题是：

Is it somehow possible to create a __device__ std::function and return that from the foo() method? 是否有可能创建一个__device__ std::function并从foo()方法返回它？
If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? 如果这是不可能的，有没有其他方法可以动态选择lambda并将其传递给CUDA内核？ Hard-coding multiple calls to kernel() with all the possible lambdas is not an option. 使用所有可能的lambdas硬编码对kernel()多次调用不是一种选择。

So far, from the quick research I did, CUDA does not have/support the necessary syntax required to make a function return a device lambda. 到目前为止，根据我所做的快速研究，CUDA没有/支持使函数返回设备lambda所需的必要语法。 I just hope I am wrong. 我只是希望我错了。 :) Any ideas? ：）有任何想法吗？

Thanks in advance 提前致谢

Answer 1

Before actually answering, I have to wonder whether your question isn't an XY problem . 在实际回答之前，我不得不怀疑你的问题是不是XY问题。 That is, I am by default skeptical that people have a good excuse for executing code through lambdas/function pointers on the device. 也就是说，我默认持怀疑态度，人们有充分的借口通过设备上的lambdas /函数指针执行代码。

But I won't evade your question like that... 但我不会像那样回避你的问题......

Is it somehow possible to create a __device__ std::function and return that from the foo() method? 是否有可能创建一个__device__ std::function并从foo（）方法返回它？

Short answer: No, try something else. 简短的回答：不，尝试别的。

Longer answer: If you want to implement a large chunk of the standard library on the device side, then maybe you could have a device-side std::function -like class. 更长的答案：如果你想在设备端实现一大块标准库，那么也许你可以拥有一个设备端的std::function -like类。 But I'm not sure that's even possible (quite possibly not), and anyway - it's beyond the capabilities of everyone except very seasoned library developers. 但我不知道这甚至有可能（很可能不是），反正-这是超越除了很老道库开发的每个人的能力。 So, do something else. 所以，做点别的。

If this is not possible, is there any other way to dynamically select a lambda and pass it to the CUDA kernel? 如果这是不可能的，有没有其他方法可以动态选择lambda并将其传递给CUDA内核？ Hard-coding multiple calls to kernel() with all the possible lambdas is not an option. 使用所有可能的lambdas硬编码对kernel（）的多次调用不是一种选择。

First, remember that lambdas are essentially anonymous classes - and thus, if they don't capture anything, they're reducible to function pointers since the anonymous classes have no data, just an operator() . 首先，请记住lambdas本质上是匿名类 - 因此，如果它们没有捕获任何东西，它们可以简化为函数指针，因为匿名类没有数据，只有一个operator() 。

So if the lambdas have the same signature and no capture, you can cast them into a (non-member-)function pointer and pass those to the function; 因此，如果lambdas具有相同的签名而没有捕获，则可以将它们转换为（非成员）函数指针并将其传递给函数; and this definitely works, see this simple example on nVIDIA's forums. 这绝对有效，请参阅nVIDIA论坛上的这个简单示例。

Another possibility is using run-time mapping from type id's or other such keys into instances of these types, or rather, to constructors. 另一种可能性是使用从类型id或其他此类密钥到这些类型的实例的运行时映射，或者更确切地说，使用构造函数。 That is, using a factory . 也就是说，使用工厂。 But I don't want to get into the details of that to not make this answer longer than it already is; 但是我不想详细说明这个问题，不能让这个答案比现在更长; and it's probably not a good idea. 这可能不是一个好主意。

Answer 2

While I don't think you can achieve what you want using virtual functions that return device lambdas, you can achieve something similar by passing a static device member function as the template parameter to your kernel. 虽然我认为你不能使用返回设备lambdas的虚函数来实现你想要的东西，你可以通过将静态设备成员函数作为模板参数传递给你的内核来实现类似的功能。 An example is provided below. 下面提供一个例子。 Note that the classes in this example could also be structs if you prefer. 请注意，如果您愿意，此示例中的类也可以是结构。

#include <iostream>

// Operation: Element-wise logarithm
class OpLog {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = logf(x[tid]);
    };
};

// Operation: Element-wise exponential
class OpExp {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = expf(x[tid]);
    }
};

// Generic kernel
template < class Op >
__global__ void my_kernel(float * x) {
    int tid = threadIdx.x;
    Op::foo(tid,x);
}

// Driver
int main() {

    using namespace std;

    // length of vector
    int len = 10;

    // generate data
    float * h_x = new float[len];
    for(int i = 0; i < len; i++) {
        h_x[i] = rand()/float(RAND_MAX);
    }

    // inspect data
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // copy onto GPU
    float * d_x;
    cudaMalloc(&d_x, len*sizeof(float));
    cudaMemcpy(d_x, h_x, len*sizeof(float), cudaMemcpyHostToDevice);

    // Take the element-wise logarithm
    my_kernel<OpLog><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // Take the element-wise exponential
    my_kernel<OpExp><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;


}

CUDA：如何从主机函数返回设备lambda

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-05-28 22:03:04

解决方案2
1 2018-02-12 20:13:29

CUDA：如何从主机函数返回设备lambda

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-05-28 22:03:04

解决方案2 1 2018-02-12 20:13:29

解决方案1
2 已采纳 2017-05-28 22:03:04

解决方案2
1 2018-02-12 20:13:29