简体   繁体   English

帮助! CUDA kernel 使用过多后将不再启动 memory

[英]HELP! CUDA kernel will no longer launch after using too much memory

I'm writing a program that requires the following kernel launch:我正在编写一个需要以下 kernel 启动的程序:

dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);

I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation.我忘记在程序结束时释放 pScaleSpace 数组,然后通过 CUDA 分析器运行程序,该程序连续运行 15 次,用完很多 memory / 导致大量碎片。 Now whenever I run the program, the kernel doesn't even launch.现在,每当我运行程序时,kernel 甚至都不会启动。 If I look at the list of function calls recorded by the profiler, the kernel is not there.如果我查看分析器记录的 function 调用列表,则 kernel 不存在。 I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again.我意识到这是一个非常愚蠢的错误,但我不知道此时我能做些什么来让程序再次运行。 I have restarted my computer, but that did not help.我已经重新启动了我的计算机,但这并没有帮助。 If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.如果我减小 kernel 的尺寸,它运行良好,但当前尺寸完全在我的卡允许的最大值之内。

Max threads per block: 1024
Max grid dimensions: 65535,65535,65535

Any suggestions appreciated, thanks in advance!任何建议表示赞赏,在此先感谢!

Try launching with lesser number of threads.尝试使用较少数量的线程启动。 If that works, it means that each of your threads is doing a lot of work or using a lot of memory.如果可行,则意味着您的每个线程都在做大量工作或使用大量 memory。 Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.因此,CUDA 在您的硬件上实际上不可能启动最大可能的线程数。

You may have to make your CUDA code more efficient to be able to launch more threads.您可能必须使您的 CUDA 代码更高效,才能启动更多线程。 You could try slicing your kernel into smaller pieces if it has complex logic inside it.如果 kernel 内部有复杂的逻辑,您可以尝试将其切片。 Or get more powerful hardware.或者获得更强大的硬件。

If you compile your code like this:如果你像这样编译你的代码:

nvcc -Xptxas="-v" [other compiler options]

the assembler will report the number of local heap memory that the code requires.汇编器将报告代码所需的本地堆 memory 的数量。 This can be a useful diagnostic to see what the memory footprint of the kernel is.这对于查看 kernel 的 memory 封装是什么可能是有用的诊断。 There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.还有一个 API 调用cudaThreadSetLimit可用于控制每个线程堆 memory 的数量,kernel 将在执行期间尝试和消耗。

Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage.最近的工具包附带了一个名为 cuda-memchk 的实用程序,它提供类似于 valgrind 的 kernel memory 访问分析,包括缓冲区溢出和非法 ZCD69B4957F06CD818D77BF3D61980E291 使用。 It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.

I got it.我知道了。 nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. nVidia NSight 2.0 - 据称支持 CUDA 4 - 将我的 CUDA_INC_PATH 更改为使用 CUDA 3.2。 No wonder it wouldn't let me allocate 1024 threads per block, All relief and jubilation aside.难怪它不会让我为每个块分配 1024 个线程,除了所有的欣慰和欢呼。 that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.考虑到我已经安装了 CUDA 4.0 RC2,这是一个非常愚蠢和烦人的错误。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM