CUDA函数的条件编译

Question

I created a CUDA function for calculating the sum of an image using its histogram.我创建了一个 CUDA 函数，用于使用其直方图计算图像的总和。

I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.我正在尝试为多种计算功能编译内核和包装函数。

Kernel:核心：

__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
    int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
    int yIndex = blockIdx.y * blockDim.y + threadIdx.y;

#if __CUDA_ARCH__ > 110   //Shared Memory For Devices Above Compute 1.1
    __shared__ int shared_hist[256];
#endif

    int global_tid = yIndex * pitch + xIndex;

    int block_tid = threadIdx.y * blockDim.x + threadIdx.x;

    if(xIndex>=width || yIndex>=height) return;

#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1

    atomicAdd(&hist[pSrc[global_tid]],1);   /*< Atomic Add In Global Memory */

#elif __CUDA_ARCH__ > 110   //Calculate Histogram In Shared Memory For Compute Above 1.1

    shared_hist[block_tid] = 0;   /*< Clear Shared Memory */
    __syncthreads();

    atomicAdd(&shared_hist[pSrc[global_tid]],1);    /*< Atomic Add In Shared Memory */
    __syncthreads();

    if(shared_hist[block_tid] > 0)  /* Only Write Non Zero Bins Into Global Memory */
        atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else 
    return;     //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}

Wrapper Function:包装功能：

int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{

#if __CUDA_ARCH__ == 100
    printf("Compute Capability Not Supported\n");
    return 0;

#else
    int *hHist,*dHist;
    cudaMalloc(&dHist,256*sizeof(int));
    cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);

    cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);

    dim3 Block(16,16);
    dim3 Grid;

    Grid.x = (width + Block.x - 1)/Block.x;
    Grid.y = (height + Block.y - 1)/Block.y;

    calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);

    cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);

    cudaStreamSynchronize(stream);

    (*sum) = 0.0;
    for(int i=1; i<256; i++)
        (*sum) += (hHist[i] * i);

    printf("sum = %f\n",(*sum));

    cudaFree(dHist);
    cudaFreeHost(hHist);

    return 1;
#endif

}

Question 1:问题 1：

When compiling for sm_10 , the wrapper and the kernel shouldn't execute.为sm_10编译时，不应执行包装器和内核。 But that is not what happens.但事实并非如此。 The whole wrapper function executes.整个包装函数执行。 The output shows sum = 0.0 .输出显示sum = 0.0 。

I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.我希望输出为Compute Capability Not Supported因为我在包装函数的开头添加了printf语句。

How can I prevent the wrapper function from executing on sm_10 ?如何防止包装函数在sm_10执行？ I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?我不想添加任何运行时检查，如 if 语句等。可以通过模板元编程来实现吗？

Question 2:问题2：

When compiling for greater than sm_10 , the program executes correctly only if I add cudaStreamSynchronize after the kernel call.当编译大于sm_10 ，只有在内核调用后添加cudaStreamSynchronize ，程序才能正确执行。 But if I do not synchronize, the output is sum = 0.0 .但是如果我不同步，输出是sum = 0.0 。 Why is it happening?为什么会发生？ I want the function to be asynchronous wrt the host as much as possible.我希望该函数尽可能与主机异步。 Is it possible to shift the only loop inside the kernel?是否可以移动内核内部的唯一循环？

I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.我在 Windows 8 上使用 GTX460M、CUDA 5.0、Visual Studio 2008。

Answer 1

Ad.广告。 Question 1问题 1

As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code.正如罗伯特在评论中已经解释的那样 - __CUDA_ARCH__仅在编译设备代码时定义。 To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU.澄清一下：当您调用 nvcc 时，代码会被解析和编译两次 - 一次用于 CPU，一次用于 GPU。 The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting. __CUDA_ARCH__的存在可用于检查发生了这两个传递中的哪一个，然后对于设备代码 - 正如您在内核中所做的那样 - 可以检查您的目标 GPU。

However, for the host side it is not all lost.然而，对于主机端来说，并不是全部丢失。 While you don't have __CUDA_ARCH__ , you can call API function cudaGetDeviceProperties which returns lots of information about your GPU.虽然您没有__CUDA_ARCH__ ，但您可以调用 API 函数cudaGetDeviceProperties ，它会返回有关您的 GPU 的大量信息。 In particular, you can be interested in fields major and minor which indicate the Compute Capability.特别是，您可能对表示计算能力的major和minor领域感兴趣。 Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.注意 - 这是在运行时完成的，而不是预处理阶段，因此相同的 CPU 代码将适用于所有 GPU。

Ad.广告。 Question 2问题2

Kernel calls and cudaMemoryAsync are asynchronous.内核调用和cudaMemoryAsync是异步的。 It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work.这意味着如果您不调用cudaStreamSynchronize （或类似方法），即使您的 GPU 尚未完成您的工作，后续 CPU 代码也会继续运行。 This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop.这意味着，你从复制数据dHist到hHist当你开始在工作可能还没到达那里hHist的循环。 If you want to work on the output from a kernel you have to wait till the kernel finishes.如果你想处理内核的输出，你必须等到内核完成。

Note that cudaMemcpy (without Async ) has an implicit synchronization inside.请注意， cudaMemcpy （没有Async ）内部有一个隐式同步。

CUDA函数的条件编译

问题描述

Kernel:核心：

Wrapper Function:包装功能：

Question 1:问题 1：

Question 2:问题2：

1 个解决方案

解决方案1
2 已采纳 2012-11-18 08:22:55

CUDA函数的条件编译

问题描述

Kernel:核心：

Wrapper Function:包装功能：

Question 1:问题 1：

Question 2:问题2：

1 个解决方案

解决方案1 2 已采纳 2012-11-18 08:22:55

解决方案1
2 已采纳 2012-11-18 08:22:55