简体   繁体   English

printf里面有CUDA __global__函数

[英]printf inside CUDA __global__ function

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. 我目前正在GPU上编写矩阵乘法并希望调试我的代码,但由于我不能在设备函数中使用printf,我还能做些什么来查看该函数内部的内容。 This my current function: 这是我目前的功能:

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called. 我很想知道Ad和Bd是否是我认为的,看看是否真的被调用了。

CUDA now supports printf s directly in the kernel. CUDA现在直接在内核中支持printf For formal description see Appendix B.16 of the CUDA C Programming Guide . 有关形式描述,请参阅“ CUDA C编程指南”的附录B.16。

EDIT 编辑

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher. 为了避免误导人们,正如M. Tibbits所指出的,printf可用于任何计算能力2.0及更高版本的GPU。

END OF EDIT 编辑结束

You have choices: 你有选择:

  • Use a GPU debugger, ie cuda-gdb on Linux or Nexus on Windows 使用GPU调试器,即Linux上的cuda-gdb或Windows上的Nexus
  • Use cuprintf, which is available for registered developers (sign up here ) 使用cuprintf,可供注册开发人员使用(在此处注册)
  • Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise) 手动复制要查看的数据,然后在内核完成后将该缓冲区转储到主机上(记得同步)

Regarding your code snippet: 关于你的代码片段:

  • Consider passing the Matrix structs in via pointer (ie cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit 考虑在via指针中传递Matrix结构(即cudaMemcpy它们到设备,然后传入设备指针),现在你没有问题,但如果函数签名变得非常大,那么你可能会达到256字节的限制
  • You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (cf the transposeNew sample in the SDK) 你从Ad读取效率低,每次读入Melement时你都会有一个32字节的内存事务 - 考虑使用共享内存作为暂存区域(参见SDK中的transposeNew示例)

by the way.. 顺便说说..

See "Formatted output" (currently B.17) section of CUDA C Programming Guide. 请参阅“CUDA C编程指南”的“格式化输出”(当前为B.17)部分。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 CUDA GPU __global__ function 未完成 - CUDA GPU __global__ function does not complete 无法使用cuda进入__global__函数 - can't enter into __global__ function using cuda CUDA - 无效的 __global__ 写入大小为 4 - CUDA- Invalid __global__ write of size 4 如何在CUDA中网格中所有线程之间共享的__global__内核中定义变量 - How to define variable inside the __global__ kernel shared between all threads in the grid in CUDA 如何将类的非静态成员函数传递到CUDA内核函数(__global__函数)中 - How to pass non-static member function of a class into CUDA kernel function (__global__ function) CUDA fmod - 不允许从__global__函数调用__host__函数 - CUDA fmod - calling a __host__ function from a __global__ function is not allowed 从常规 C++ 代码调用 __global__ CUDA 函数 - Calling __global__ CUDA functions from regular C++ code 将 __device__ lambda 作为参数传递给 __global__ 函数 - Pass a __device__ lambda as argument to a __global__ function 为什么在CUDA的__global__内核中使用max和min宏未给出正确答案? - Why use max and min macro in __global__ kernel of CUDA not giving correct answer? CUDA:__device__ 和 __global__ 错误:在“unsigned”/“void”之前预期的构造函数、析构函数或类型转换 - CUDA: __device__ and __global__ error: expected constructor, destructor, or type conversion before “unsigned”/“void”"
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM