简体   繁体   English

是什么让 cuLaunchKernel 因 CUDA_ERROR_INVALID_HANDLE 而失败?

[英]What makes cuLaunchKernel fail with CUDA_ERROR_INVALID_HANDLE?

I'm launching a CUDA kernel I've compiled, using the cudLaunchKernel() driver API function.我正在启动一个 CUDA kernel 我已经编译,使用cudLaunchKernel()驱动程序 API ZC1C4252748EC178 I'm passing my parameters in a kernelParams array, and passing nullptr for the extra argument.我在kernelParams数组中传递我的参数,并为extra参数传递nullptr

Unfortunately, this fails, with the error: CUDA_ERROR_INVALID_HANDLE .不幸的是,这失败了,错误: CUDA_ERROR_INVALID_HANDLE Why?为什么? I checked the Driver API documentation to see how the function might fail in what cases, and edit it discusses the failure with CUDA_ERROR_INVALID_VALUE (not the same thing).我检查了驱动程序 API 文档,了解 function 在什么情况下可能会失败,并编辑它讨论了CUDA_ERROR_INVALID_VALUE的失败(不是同一件事)。 It doesn't discuss the error I get.它没有讨论我得到的错误。

Since there is more than one parameter to cuLaunchKernel() which is some sort of a handle - what does this failure mean?由于cuLaunchKernel()有多个参数,这是某种句柄 - 这个失败意味着什么? (And if there are multiple options - what are they?) (如果有多种选择——它们是什么?)

One possibility is a failure due to a CUDA driver context switch.一种可能性是由于 CUDA 驱动程序上下文切换导致的故障。 You may have probably inadvertently performed some action which pushes or replaces the current context for the CUDA device;您可能无意中执行了一些操作来推送或替换 CUDA 设备的当前上下文; and loaded modules are part of context - so your compiled and loaded kernel can no longer be loaded in the current context.并且加载的模块是上下文的一部分 - 因此您编译和加载的 kernel 不能再在当前上下文中加载。 This triggers a CUDA_ERROR_INVALID_HANDLE failure.这会触发CUDA_ERROR_INVALID_HANDLE失败。

Assuming this is the case, switch the context before the launch, eg this way:假设是这种情况,请在启动前切换上下文,例如:

cuCtxPushCurrent(my_driver_context);
cuLaunchKernel(/*etc. etc. */);
/* possibly */ cuCtxPopCurrent(NULL);

or like so:或者像这样:

cuCtxSetCurrent(my_driver_context);
cuLaunchKernel(/*etc. etc. */);

Note that you may be risking memory leaks, if you pop and ignore the only reference to a valid context;请注意,如果您弹出并忽略对有效上下文的唯一引用,您可能会面临 memory 泄漏的风险; and you may also risk some other code assuming that the context it has put in place is still the active one.并且您还可能会冒一些其他代码的风险,假设它已经放置的上下文仍然是活动的。

Well, in my case it was an OOM error (Out of Memory) error which for some reason was not reported as such.好吧,就我而言,这是一个 OOM 错误(内存不足)错误,由于某种原因没有这样报告。 When I reduced the batch size of my model it worked.当我减小 model 的批量大小时,它起作用了。 Maybe you should check if this is the case also.也许你应该检查是否也是这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM