printf里面有CUDA global函数

Question

I am currently writing a matrix multiplication on a GPU and would like to debug my code, but since I can not use printf inside a device function, is there something else I can do to see what is going on inside that function. 我目前正在GPU上编写矩阵乘法并希望调试我的代码，但由于我不能在设备函数中使用printf，我还能做些什么来查看该函数内部的内容。 This my current function: 这是我目前的功能：

__global__ void MatrixMulKernel(Matrix Ad, Matrix Bd, Matrix Xd){

    int tx = threadIdx.x;
    int ty = threadIdx.y;

    int bx = blockIdx.x;
    int by = blockIdx.y;

    float sum = 0;

    for( int k = 0; k < Ad.width ; ++k){
        float Melement = Ad.elements[ty * Ad.width + k];
        float Nelement = Bd.elements[k * Bd.width + tx];
        sum += Melement * Nelement;
    }

    Xd.elements[ty * Xd.width + tx] = sum;
}

I would love to know if Ad and Bd is what I think it is, and see if that function is actually being called. 我很想知道Ad和Bd是否是我认为的，看看是否真的被调用了。

Answer 1

CUDA now supports printf s directly in the kernel. CUDA现在直接在内核中支持printf 。 For formal description see Appendix B.16 of the CUDA C Programming Guide . 有关形式描述，请参阅“ CUDA C编程指南”的附录B.16。

Answer 2

EDIT 编辑

To avoid misleading people, as M. Tibbits points out printf is available in any GPU of compute capability 2.0 and higher. 为了避免误导人们，正如M. Tibbits所指出的，printf可用于任何计算能力2.0及更高版本的GPU。

END OF EDIT 编辑结束

You have choices: 你有选择：

Use a GPU debugger, ie cuda-gdb on Linux or Nexus on Windows 使用GPU调试器，即Linux上的cuda-gdb或Windows上的Nexus
Use cuprintf, which is available for registered developers (sign up here ) 使用cuprintf，可供注册开发人员使用（在此处注册）
Manually copy the data that you want to see, then dump that buffer on the host after your kernel has completed (remember to synchronise) 手动复制要查看的数据，然后在内核完成后将该缓冲区转储到主机上（记得同步）

Regarding your code snippet: 关于你的代码片段：

Consider passing the Matrix structs in via pointer (ie cudaMemcpy them to the device, then pass in the device pointer), right now you will have no problem but if the function signature gets very large then you may hit the 256 byte limit 考虑在via指针中传递Matrix结构（即cudaMemcpy它们到设备，然后传入设备指针），现在你没有问题，但如果函数签名变得非常大，那么你可能会达到256字节的限制
You have inefficient reads from Ad, you will have a 32-byte transaction to the memory for each read into Melement - consider using shared memory as a staging area (cf the transposeNew sample in the SDK) 你从Ad读取效率低，每次读入Melement时你都会有一个32字节的内存事务 - 考虑使用共享内存作为暂存区域（参见SDK中的transposeNew示例）

Answer 3

cuprintf cuprintf
try Nexus http://developer.nvidia.com/object/nexus.html 试试Nexus http://developer.nvidia.com/object/nexus.html

by the way.. 顺便说说..

use shared memory 使用共享内存
multiply outside of the loop 在循环外加倍
Look at this: http://www.seas.upenn.edu/~cis665/LECTURES/Lecture11.ppt 看看这个： http ： //www.seas.upenn.edu/~cis665/LECTURES/Lecture11.ppt

Answer 4

See "Formatted output" (currently B.17) section of CUDA C Programming Guide. 请参阅“CUDA C编程指南”的“格式化输出”（当前为B.17）部分。

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

printf里面有CUDA global函数

问题描述

4 个解决方案

解决方案1
72 2011-07-05 17:10:57

解决方案2
16 已采纳 2010-02-01 08:46:45

解决方案3
4 2010-02-09 00:00:26

解决方案4
2 2013-10-29 19:47:19

printf里面有CUDA __global__函数

问题描述

4 个解决方案

解决方案1 72 2011-07-05 17:10:57

解决方案2 16 已采纳 2010-02-01 08:46:45

解决方案3 4 2010-02-09 00:00:26

解决方案4 2 2013-10-29 19:47:19

printf里面有CUDA global函数

解决方案1
72 2011-07-05 17:10:57

解决方案2
16 已采纳 2010-02-01 08:46:45

解决方案3
4 2010-02-09 00:00:26

解决方案4
2 2013-10-29 19:47:19