简体   繁体   English

OpenCL:从TDR中断中区分计算失败

[英]OpenCL: Distinguishing computation failure from TDR interrupt

When running long OpenCL computations on Windows using the GPU that also runs the main display, the OS may interrupt the computation with Timeout Detection and Recovery . 当使用还运行主显示屏的GPU在Windows上运行长时间的OpenCL计算时,操作系统可能会通过“ 超时检测和恢复”来中断计算。

In my experience (Java, using JavaCL by NativeLibs4Java, with an NVidia GPU) this manifests as an "Out Of Resources" (cl_out_of_resources) error when ivoking clEnqueueReadBuffer. 以我的经验(Java,通过NativeLibs4Java使用JavaCL和NVidia GPU进行使用),这在绕开clEnqueueReadBuffer时表现为“资源不足”(cl_out_of_resources)错误。

The problem is that I get the exact same message when the OpenCL program for other reasons (eg, because of accessing invalid memory). 问题是,由于其他原因(例如,由于访问无效的内存),我在使用OpenCL程序时会得到完全相同的消息。

Is there a (semi) reliable way to distinguish between an "Out of Resources" caused by TDR and an "Out of Resources" caused by other problems? 是否存在(半)可靠的方法来区分由TDR引起的“资源不足”和由其他问题引起的“资源不足”?

Alternately, can I at least reliably (in Java / through OpenCL API) determine that the GPU used for computation is also running the display? 或者,我是否至少可以可靠地(在Java中/通过OpenCL API)确定用于计算的GPU也正在运行显示器?

I am aware of this question however, the answer there is concerned with scenarios when clFinish does not return, which is not a problem for me (my code so far never stayed frozen within the OpenCL API). 我知道这个问题,但是答案与clFinish不返回的情况有关,这对我来说不是问题(到目前为止,我的代码从未在OpenCL API中冻结)。

Is there a (semi) reliable way to distinguish between an "Out of Resources" caused by TDR and an "Out of Resources" caused by other problems? 是否存在(半)可靠的方法来区分由TDR引起的“资源不足”和由其他问题引起的“资源不足”?

1) 1)

If you can access 如果可以访问

 KeyPath : HKEY_LOCAL_MACHINE\\System\\CurrentControlSet\\Control\\GraphicsDrivers KeyValue : TdrDelay ValueType : REG_DWORD ValueData : Number of seconds to delay. 2 seconds is the default value. 

from WMI to multiply it by 从WMI乘以

 KeyPath : HKEY_LOCAL_MACHINE\\System\\CurrentControlSet\\Control\\GraphicsDrivers KeyValue : TdrLimitCount ValueType : REG_DWORD ValueData : Number of TDRs before crashing. The default value is 5. 

again with WMI. 再次与WMI。 You get 10 seconds when you multiply these. 将这些乘以10秒。 And, you should get 而且,你应该得到

 KeyPath : HKEY_LOCAL_MACHINE\\System\\CurrentControlSet\\Control\\GraphicsDrivers KeyValue : TdrLimitTime ValueType : REG_DWORD ValueData : Number of seconds before crashing. 60 seconds is the default value. 

that should read 60 seconds from WMI. 应该从WMI读取60秒。

For this example computer, it takes 5 x 2-second+1 extra delays before 60 seconds final to crash limit. 对于此示例计算机,在达到崩溃限制的最后60秒之前需要5 x 2秒+1的额外延迟。 Then you can check from application if last stopwatch counter exceeded those limits. 然后,您可以从应用程序中检查最后一个秒表计数器是否超过了这些限制。 If yes, probably it is TDR. 如果是,则可能是TDR。 There is also a thread-exit-from-driver time limit on top of these, 除此以外,还有从驱动程序退出线程的时间限制,

 KeyPath : HKEY_LOCAL_MACHINE\\System\\CurrentControlSet\\Control\\GraphicsDrivers KeyValue : TdrDdiDelay ValueType : REG_DWORD ValueData : Number of seconds to leave the driver. 5 seconds is the default value. 

which is 5 seconds default. 默认为5秒。 Accessing an invalid memory segment should exit quicker. 访问无效的内存段应更快退出。 Maybe you can increase these TDR time limits from WMI up to some minutes so it can let the program compute without crashing becauso of preemption starvation. 也许您可以将这些TDR时间限制从WMI延长到几分钟,以便它可以使程序进行计算而不会因抢占性饥饿而崩溃。 But changing registry could be dangerous, for example you set TDR time limit to 1 second or some slice of it, then windows may never boot without constant TDR crashes so just reading those variables must be safer. 但是更改注册表可能很危险,例如,将TDR时间限制设置为1秒或其中的一部分,那么Windows可能永远不会在没有持续TDR崩溃的情况下启动,因此读取这些变量必须更加安全。

2) 2)

You separate total work into much smaller parts. 您可以将全部工作分成更小的部分。 If data is not separable, copy it once, then start enqueueing the long-runnning kernel as very-short-ranged-kernels n times with some waiting between any two. 如果数据不可分离,请复制一次,然后将长时间运行的内核作为非常短距离的内核入队n次,并且在任意两次之间等待一些时间。

Then, you must be sure that TDR is elliminated. 然后,您必须确保删除了TDR。 If this version runs but the long-running-kernel doesn't, it is TDR fault.If it is opposite, it is memory crash. 如果运行此版本但长时间运行的内核未运行,则表示TDR错误;相反,则是内存崩溃。 Looks like this: 看起来像这样:

short running x 1024 times
long running
long running <---- fail? TDR! because memory would crash short ver. too!
long running

another try: 另一种尝试:

short running x 1024 times <---- fail? memory! because only 1ms per kernel
long running
long running 
long running

Alternately, can I at least reliably (in Java / through OpenCL API) determine that the GPU used for computation is also running the display? 或者,我是否至少可以可靠地(在Java中/通过OpenCL API)确定用于计算的GPU也正在运行显示器?

1) 1)

Use interoperability properties of both devices: 使用两个设备的互操作性属性:

// taken from Intel's site:
std::vector<cl_device_id> devs (devNum);
//reading the info
clGetGLContextInfoKHR(props, CL_DEVICES_FOR_GL_CONTEXT_KHR, bytes, devs, NULL))

this gives interoperable devices list. 这给出了可互操作的设备列表。 You should get its id to exclude it if you don't want to use it. 如果您不想使用它,则应获取它的ID以将其排除。

2) 2)

Have another thread run some opengl or directx static object drawing code to keep one of the gpus busy. 让另一个线程运行一些opengl或DirectX静态对象绘制代码,以使其中一个GPU保持繁忙。 Then test all gpus simultaneously using another thread for some trivial opencl kernel codes. 然后使用另一个线程对一些通用的opencl内核代码同时测试所有GPU。 Test: 测试:

  • opengl starts drawing something with high triangle count @60 fps. opengl开始以60 fps的高三角形数绘制图形。
  • start devices for opencl compute, get average kernel executions per second 启动用于opencl计算的设备,获得平均每秒的内核执行次数
  • device 1: 30 keps 装置1:30吉普
  • device 2: 40 keps 设备2:40吉普
  • after a while, stop opengl and close its windows(if not already) 稍后,停止opengl并关闭其窗口(如果尚未关闭)
  • device 1: 75 keps -----> highest increase in percentage!-->display!!! 设备1:75 Keep ----->百分比增加最高!->显示!!!
  • device 2: 41 keps ----> not as high increase but it can 设备2:41科普->增幅不大,但可以

you should not copy any data between devices while doing this so CPU/RAM will not be bottleneck. 执行此操作时,请勿在设备之间复制任何数据,以免CPU / RAM成为瓶颈。

3) 3)

If data is separable, then you can use a divide-and-conquer algorithm to give any gpu get its own work only when it is available and let display part more flexibility (because this is performance-aware solution and could be similar to short-running version but scheduling is done on multiple gpus) 如果数据是可分离的,则可以使用分而治之的算法,使任何GPU仅在可用时才发挥作用,并让显示部件具有更大的灵活性(因为这是性能感知型解决方案,可能类似于short-运行版本,但计划在多个GPU上完成)

4) 4)

I didn't check because I sold my 2nd gpu but, you should try 我没有检查,因为我卖出了第二代GPU,但是,您应该尝试

CL_DEVICE_TYPE_DEFAULT

in your multi-gpu system to test if it gets display gpu or not. 在您的多GPU系统中测试它是否获得显示GPU。 Shut down pc, plug monitor cable to other card, try again. 关闭计算机,将监视器电缆插入其他卡,然后重试。 Shut down, change seats of cards, try again. 关闭,更改卡座,然后重试。 Shut down, remove one of the cards so only 1 gpu and 1 cpu is left, try again. 关闭,取出其中一张卡,这样仅剩1 gpu和1 cpu,请重试。 If all these give only display gpu then it should be marking display gpu as default. 如果所有这些都只显示display gpu,则应将display gpu标记为默认值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM