CUDA：是否可以将一个内核视为“主”来执行 memory malloc，并运行其他“逻辑代码”？

Question

I'm porting a C++ program to CUDA, the calculations are all about matrices/vectors.我正在将 C++ 程序移植到 CUDA，计算都是关于矩阵/向量的。 The first ported function is matrix's FFT.第一个移植的 function 是矩阵的 FFT。 After porting matrix's FFT to CUDA, I found: data transter between CPU and GPU take almost all the time.将矩阵的 FFT 移植到 CUDA 后，我发现：CPU 和 GPU 之间的数据转换器几乎一直占用。

// interface: do shift and inverse FFT on a matrix
extern "C" int cu_inv_fft_shift(std::complex<double>* ptrDest, int nRows, int nCols) {

    #ifdef ENABLE_DEBUG_TIME_MEASURE
    float ms1, ms2 = 0.f, ms3 = 0.f, ms4 = 0.f;
    cudaEvent_t startEvent, stopEvent;
    cudaEventCreate(&startEvent); cudaEventCreate(&stopEvent);
    #endif

    // step1: cpu -> gpu, and column-major -> row-major
    #ifdef ENABLE_DEBUG_TIME_MEASURE
    cudaEventRecord(startEvent, 0);
    #endif

    cufftDoubleComplex* ptr_data = matrix_to_cu_data(ptrDest, nRows, nCols);

    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(stopEvent, 0); 
    cudaEventSynchronize(startEvent);cudaEventSynchronize(stopEvent);
    cudaEventElapsedTime(&ms1, startEvent, stopEvent);
    #endif

    // step2: do shift on gpu buffer
    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(startEvent, 0);
    #endif

    ptr_data = fft_shift_cd(ptr_data, nRows, nCols);

    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(stopEvent, 0); 
    cudaEventSynchronize(startEvent);cudaEventSynchronize(stopEvent);
    cudaEventElapsedTime(&ms2, startEvent, stopEvent);
    #endif

    // step3: do FFT on gpu buffer
    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(startEvent, 0);
    #endif

    ptr_data = do_fft_cd(ptr_data, nRows, nCols, CUFFT_INVERSE);

    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(stopEvent, 0); 
    cudaEventSynchronize(startEvent);cudaEventSynchronize(stopEvent);
    cudaEventElapsedTime(&ms3, startEvent, stopEvent);
    #endif

    // step4: row-major -> column-major, and gpu -> cpu
    #ifdef ENABLE_DEBUG_TIME_MEASURE2
    cudaEventRecord(startEvent, 0);
    #endif

    ptr_data = cu_data_to_matrix_inv(ptrDest, nRows, nCols, ptr_data);

    #ifdef ENABLE_DEBUG_TIME_MEASURE
    cudaEventRecord(stopEvent, 0); 
    cudaEventSynchronize(startEvent);cudaEventSynchronize(stopEvent);
    cudaEventElapsedTime(&ms4, startEvent, stopEvent);
    #endif

    #ifdef ENABLE_DEBUG_TIME_MEASURE
    cudaEventDestroy(startEvent); cudaEventDestroy(stopEvent);
    //std::cout << __func__ << " called.."<< std::endl;
    printf("%s: %.4fms, %.4fms, %.4fms, %.4fms\n", __func__, ms1, ms2, ms3, ms4);
    #endif

    cudaFree(ptr_data);
    return 0;
}

The measured result when the matrix is 8192x8192:矩阵为8192x8192时的实测结果：

cu_fwd_fft_shift: 4.2841ms, 0.7394ms, 0.0492ms, 4.2857ms

It means that(It is verified):这意味着（已验证）：

CPU->GPU: 4.2ms. CPU->GPU：4.2ms。
forward FFT: 0.7ms.正向 FFT：0.7ms。
FFT shift: 0.05ms. FFT偏移：0.05ms。
GPU->CPU: 4.2ms. GPU->CPU：4.2ms。

The problem I encountered is that: in a CPU function, there are some "code snippet" (just like the FFT) could be ported to CUDA, but thre are some if/else code, and intermediate memory malloc between them. The problem I encountered is that: in a CPU function, there are some "code snippet" (just like the FFT) could be ported to CUDA, but thre are some if/else code, and intermediate memory malloc between them.

I want to reduce data transfer CPU<-->GPU.My optinion is that porting a whole CPU function to CUDA(GPU side), But there are many "logic code" like if/else, intermediate memory malloc.我想减少数据传输 CPU<-->GPU。我的选择是将整个 CPU function 移植到 CUDA（GPU 端），但是有许多“逻辑代码”，例如 if/else，中间 memory Z2224EDA30DC1D36B2F08.30DC1D36B2F08

So my question are:所以我的问题是：

Does it possible to set one core as master(just like CPU) to process these malloc / "logic code" and dispache subsequest calculation to all other cores?是否可以将一个内核设置为主内核（就像 CPU 一样）来处理这些 malloc /“逻辑代码”并将子序列计算分配给所有其他内核？
Are there any other CUDA projects can I study from?我还可以学习其他 CUDA 项目吗？ Or或者
Is this solution impossible?这个解决方案是不可能的吗？

Answer 1

Does it possible to set one core as master(just like CPU) to process these malloc / "logic code" and dispache subsequest [sic] calculation to all other cores?是否可以将一个内核设置为主内核（就像 CPU 一样）来处理这些 malloc /“逻辑代码”并将子序列 [原文如此] 计算分配给所有其他内核？

CUDA doesn't expose that level of granularity in its execution model, so that isn't possible. CUDA 在其执行 model 中没有公开该级别的粒度，因此这是不可能的。 There is dynamic parallelism , which can allow one kernel to dispatch other kernels and offers a very minimal subset of the CUDA runtime API.有动态并行性，它可以允许一个 kernel 调度其他内核，并提供 CUDA 运行时 API 的最小子集。 You might be able to adapt that paradigm to your application.您可能能够将该范例适应您的应用程序。

Are there any other CUDA projects can I study from?我还可以学习其他 CUDA 项目吗？ Or或者

If you search and read the various material NVIDIA have made available on dynamic parallelism, you might find something you can learn from and make assessment of whether that might work for your use case.如果您搜索并阅读 NVIDIA 提供的有关动态并行性的各种材料，您可能会发现一些可以学习的东西并评估这是否适用于您的用例。

Is this solution impossible?这个解决方案是不可能的吗？

Probably, yes.大概是。

In general when you start a GPU programming question or proposition with "I'm porting a C++ program to CUDA", and you mean porting in the most literal sense, you usually are doing something wrong.通常，当您以“我正在将 C++ 程序移植到 CUDA”开始 GPU 编程问题或命题时，并且您的意思是最字面意义上的移植，您通常做错了什么。 It is exceedingly rare that a conventional codebase or serial algorithm can be blindly "ported" and either be correct, fast, or both correct and fast.传统代码库或串行算法可以盲目“移植”并且正确、快速或既正确又快速的情况极为罕见。 The GPU programming paradigm is rather different to conventional single and multithreaded CPU coding and if you try and treat it like a CPU, you will fail. GPU 编程范式与传统的单线程和多线程 CPU 编码有很大不同，如果您尝试将其视为 CPU，您将失败。

CUDA：是否可以将一个内核视为“主”来执行 memory malloc，并运行其他“逻辑代码”？

问题描述

1 个解决方案

解决方案1
2

CUDA：是否可以将一个内核视为“主”来执行 memory malloc，并运行其他“逻辑代码”？

问题描述

1 个解决方案

解决方案1 2

解决方案1
2