简体   繁体   English

函数指针(到其他内核)作为CUDA中的内核arg

[英]Function pointer (to other kernel) as kernel arg in CUDA

With dynamic parallelism in CUDA, you can launch kernels on the GPU side, starting from a certain version. 通过CUDA中的动态并行,您可以从特定版本开始在GPU端启动内核。 I have a wrapper function that takes a pointer to the kernel I want to use, and it either does this on the CPU for older devices, or on the GPU for newer devices. 我有一个包装器函数,它指向我想要使用的内核,它可以在CPU上为旧设备执行此操作,也可以在GPU上为较新的设备执行此操作。 For the fallback path it's fine, for the GPU it's not and says the memory alignment is incorrect. 对于后备路径它没关系,对于GPU它不是,并且说内存对齐不正确。

Is there a way to do this in CUDA (7)? 有没有办法在CUDA中做到这一点(7)? Are there some lower-level calls that will give me a pointer address that's correct on the GPU? 是否有一些低级调用会给我一个在GPU上正确的指针地址?

The code is below, the template "TFunc" is an attempt to get the compiler to do something different, but I've tried it strongly typed as well. 代码如下,模板“TFunc”试图让编译器做一些不同的事情,但我也尝试过强类型。

template <typename TFunc, typename... TArgs>
__global__ void Test(TFunc func, int count, TArgs... args)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 320)
    (*func)<< <1, 1 >> >(args...);
#else
    printf("What are you doing here!?\n");
#endif
}

template <typename... TArgs>
__host__ void Iterate(void(*kernel)(TArgs...), const systemInfo *sysInfo, int count, TArgs... args)
{
    if(sysInfo->getCurrentDevice()->compareVersion("3.2") > 0)
    {
        printf("Iterate on GPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
    else
    {
        printf("Iterate on CPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
}

EDIT: At the time that I originally wrote this answer, I believe the statements were correct: it was not possible to take a kernel address in host code. 编辑:在我最初写这个答案时,我相信这些陈述是正确的:在主机代码中不可能获取内核地址。 However I believe something has changed in CUDA since then, and so now (in CUDA 8, and maybe prior) it is possible to take a kernel address in host code (it's still not possible to take the address of a __device__ function in host code, however.) 但是我相信CUDA从那以后发生了一些变化,所以现在(在CUDA 8中,也许在之前)可以在主机代码中获取内核地址(仍然无法在主机代码中获取__device__函数的地址)但是。)

ORIGINAL ANSWER : 原始答案

It seems like this question comes up from time to time, although the previous examples I can think of have to do with calling __device__ functions instead of __global__ functions. 似乎这个问题不时出现,虽然我之前想到的例子与调用__device__函数而不是__global__函数有关。

In general it's illegal to take the address of a device entity (variable, function) in host code. 通常,在主机代码中获取设备实体(变量,函数)的地址是非法的。

One possible method to work around this (although the utility of this is not clear to me; it seems like there would be simpler dispatch mechanisms) is to extract the device address needed "in device code" and return that value to the host, for dispatch usage. 解决这个问题的一种可能的方法(尽管我不清楚它的效用;看起来会有更简单的调度机制)是提取“在设备代码中”所需的设备地址并将该值返回给主机,派遣用法。 In this case, I am creating a simple example that extracts the needed device addresses into __device__ variables, but you could also write a kernel to do this setup (ie to "give me a pointer address that's correct on the GPU" in your words). 在这种情况下,我正在创建一个简单的示例,将所需的设备地址提取到__device__变量中,但您也可以编写一个内核来执行此设置(即在您的文字中“给我一个在GPU上正确的指针地址”) 。

Here's a rough worked example, building on the code you have shown: 这是一个粗略的工作示例,基于您显示的代码:

$ cat t746.cu
#include <stdio.h>

__global__ void ckernel1(){

  printf("hello1\n");
}
__global__ void ckernel2(){

  printf("hello2\n");
}
__global__ void ckernel3(){

  printf("hello3\n");
}

__device__ void (*pck1)() = ckernel1;
__device__ void (*pck2)() = ckernel2;
__device__ void (*pck3)() = ckernel3;

template <typename TFunc, typename... TArgs>
__global__ void Test(TFunc func, int count, TArgs... args)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 350)
    (*func)<< <1, 1 >> >(args...);
#else
    printf("What are you doing here!?\n");
#endif
}

template <typename... TArgs>
__host__ void Iterate(void(*kernel)(TArgs...), const int sysInfo, int count, TArgs... args)
{
    if(sysInfo >= 350)
    {
        printf("Iterate on GPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
    else
    {
        printf("Iterate on CPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
}


int main(){

  void (*h_ckernel1)();
  void (*h_ckernel2)();
  void (*h_ckernel3)();
  cudaMemcpyFromSymbol(&h_ckernel1, pck1, sizeof(void *));
  cudaMemcpyFromSymbol(&h_ckernel2, pck2, sizeof(void *));
  cudaMemcpyFromSymbol(&h_ckernel3, pck3, sizeof(void *));
  Iterate(h_ckernel1, 350, 1);
  Iterate(h_ckernel2, 350, 1);
  Iterate(h_ckernel3, 350, 1);
  cudaDeviceSynchronize();
  return 0;
}

$ nvcc -std=c++11 -arch=sm_35 -o t746 t746.cu -rdc=true -lcudadevrt
$ cuda-memcheck ./t746
========= CUDA-MEMCHECK
Iterate on GPU
Iterate on GPU
Iterate on GPU
hello1
hello2
hello3
========= ERROR SUMMARY: 0 errors
$

The above ( __device__ variable) method probably can't be made to work with templated child kernels, but it might be possible to create a templated "extractor" kernel that returns the address of a (instantiated) templated child kernel. 上述( __device__变量)方法可能无法与模板化子内核一起使用,但可能会创建一个模板化的“提取器”内核,该内核返回(实例化的)模板化子内核的地址。 A rough idea of the "extractor" setup_kernel method is given in the previous answer I linked. 在我之前的链接中给出了“提取器” setup_kernel方法的概念。 Here's a rough example of the templated child kernel/extractor kernel method: 这是模板化子内核/提取器内核方法的一个粗略示例:

$ cat t746.cu
#include <stdio.h>

template <typename T>
__global__ void ckernel1(T *data){

  int my_val = (int)(*data+1);
  printf("hello: %d \n", my_val);
}
template <typename TFunc, typename... TArgs>
__global__ void Test(TFunc func, int count, TArgs... args)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 350)
    (*func)<< <1, 1 >> >(args...);
#else
    printf("What are you doing here!?\n");
#endif
}

template <typename... TArgs>
__host__ void Iterate(void(*kernel)(TArgs...), const int sysInfo, int count, TArgs... args)
{
    if(sysInfo >= 350)
    {
        printf("Iterate on GPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
    else
    {
        printf("Iterate on CPU\n");
        Test << <1, 1 >> >(kernel, count, args...);
    }
}

template <typename T>
__global__ void extractor(void (**kernel)(T *)){

  *kernel = ckernel1<T>;
}

template <typename T>
void run_test(T init){

  void (*h_ckernel1)(T *);
  void (**d_ckernel1)(T *);
  T *d_data;
  cudaMalloc(&d_ckernel1, sizeof(void *));
  cudaMalloc(&d_data, sizeof(T));
  cudaMemcpy(d_data, &init, sizeof(T), cudaMemcpyHostToDevice);
  extractor<<<1,1>>>(d_ckernel1);
  cudaMemcpy((void *)&h_ckernel1, (void *)d_ckernel1, sizeof(void *), cudaMemcpyDeviceToHost);
  Iterate(h_ckernel1, 350, 1, d_data);
  cudaDeviceSynchronize();
  cudaFree(d_ckernel1);
  cudaFree(d_data);
  return;
}

int main(){

  run_test(1);
  run_test(2.0f);

  return 0;
}

$ nvcc -std=c++11 -arch=sm_35 -o t746 t746.cu -rdc=true -lcudadevrt
$ cuda-memcheck ./t746
========= CUDA-MEMCHECK
Iterate on GPU
hello: 2
Iterate on GPU
hello: 3
========= ERROR SUMMARY: 0 errors
$

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM