简体   繁体   English

CUDA内核无法启动

[英]CUDA kernel doesn't launch

My problem is very much like this one . 我的问题是非常喜欢这一个 I run the simplest CUDA program but the kernel doesn't launch. 我运行最简单的CUDA程序,但内核没有启动。 However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. 但是,我确信我的CUDA安装没问题,因为我可以运行复杂的CUDA项目,包括几个文件(我从其他人那里拿走),没有任何问题。 In these projects, compilation and linking is done through makefiles with a lot of flags. 在这些项目中,编译和链接是通过带有大量标志的makefile完成的。 I think the problem is in the correct flags to use while compiling. 我认为问题在于编译时使用的正确标志。 I simply use a command like this: nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine): 我只是使用这样的命令: nvcc -arch=sm_20 -lcudart test.cu与这样的程序(在linux机器上运行):

 __global__ void myKernel() 
{ 

    cuPrintf("Hello, world from the device!\n"); 


} 
int main() 
{ 
    cudaPrintfInit(); 
    myKernel<<<1,10>>>(); 
    cudaPrintfDisplay(stdout, true);    
    cudaPrintfEnd(); 
} 

The program compiles correctly. 该程序正确编译。 When I add cudaMemcpy() operations, it returns no error. 当我添加cudaMemcpy()操作时,它不会返回任何错误。 Any suggestion on why the kernel doesn't launch ? 关于为什么内核不启动的任何建议?

The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. 使用printf时不打印的原因是内核启动是异步的,并且在刷新printf缓冲区之前程序正在退出。 Section B.16 of the CUDA (5.0) C Programming Guide explains this. CUDA(5.0)C编程指南的B.16节解释了这一点。

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). 在内核启动之前,printf()的输出缓冲区设置为固定大小(请参阅关联的主机端API)。 It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. 它是循环的,如果在内核执行期间产生的输出多于缓冲区中的输出,则会覆盖旧的输出。 It is flushed only when one of these actions is performed: 仅在执行以下操作之一时才刷新:

  • Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well), 通过<<< >>>或cuLaunchKernel()启动内核(在启动开始时,如果CUDA_LAUNCH_BLOCKING环境变量设置为1,则在启动结束时),
  • Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(), 通过cudaDeviceSynchronize(),cuCtxSynchronize(),cudaStreamSynchronize(),cuStreamSynchronize(),cudaEventSynchronize()或cuEventSynchronize()进行同步,
  • Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(), 内存副本通过任何阻止版本的cudaMemcpy *()或cuMemcpy *(),
  • Module loading/unloading via cuModuleLoad() or cuModuleUnload(), 通过cuModuleLoad()或cuModuleUnload()加载/卸载模块,
  • Context destruction via cudaDeviceReset() or cuCtxDestroy(). 通过cudaDeviceReset()或cuCtxDestroy()进行上下文破坏。

For this reason, this program prints nothing: 因此,此程序不打印任何内容:

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
} 

But this program prints "Hello, world from the device!\\n" ten times. 但是这个程序打印了“Hello,world from the device!\\ n”十次。

#include <stdio.h>

__global__ void myKernel() 
{ 
  printf("Hello, world from the device!\n"); 
} 

int main() 
{ 
  myKernel<<<1,10>>>(); 
  cudaDeviceSynchronize();
} 

Are you sure that your CUDA device supports the SM_20 architecture? 您确定您的CUDA设备支持SM_20架构吗?

Remove the arch= option from your nvcc command line and rebuild everything. 从nvcc命令行中删除arch =选项并重建所有内容。 This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. 这将编译1.0 CUDA架构,这将在所有CUDA设备上得到支持。 If it still doesn't run, do a build clean and make sure there are no object files left anywhere. 如果它仍然没有运行,请执行构建清理并确保没有任何目标文件留在任何位置。 Then rebuild and run. 然后重建并运行。

Also, arch= refers to the virtual architecture, which should be something like compute_10. 此外,arch =指的是虚拟架构,它应该类似于compute_10。 sm_20 is the real architecture and I believe should be used with the code= switch, not arch=. sm_20是真正的架构,我相信应该与code = switch一起使用,而不是arch =。

In Visual Studio: 在Visual Studio中:

Right click on your project > Properies > Cuda C/C++ > Device 右键单击您的项目> Properies> Cuda C / C ++> Device

and add then following to Code Generation field 然后在代码生成字段中添加

compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;

generating code for all these architecture makes your code a bit slower. 为所有这些架构生成代码会使代码变慢。 So eliminate one by one to find which compute and sm gen code is required for your GPU. 因此逐个消除以找出GPU所需的compute代码和sm代码。 But if you are shipping this to others better include all of these. 但是如果你把它发送给其他人更好地包括所有这些。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM