简体   繁体   English

当运行Keras / tensorflow-gpu时,GPU崩溃,特别是当时钟速度在0 MHz处空闲时

[英]GPU crashes when running Keras/tensorflow-gpu, specifically when clock speed goes to idle at 0 MHz

I'm using Jupyter Notebook to run Keras with a Tensorflow GPU backend. 我正在使用Jupyter Notebook在Tensorflow GPU后端上运行Keras。 I've done some testing with various dummy models while simultaneously monitoring my GPU usage using MSI Afterburner, GPU-Z, nvidia-smi and Task Manager. 我已经对各种虚拟模型进行了一些测试,同时使用MSI Afterburner,GPU-Z,nvidia-smi和任务管理器监控了我的GPU使用情况。 My GPU is a GeForce GTX 960M, which has no issues running games. 我的GPU是GeForce GTX 960M,运行游戏没有问题。 The temperatures are also low when running Keras. 运行Keras时温度也很低。

What I've noticed is that the Keras runs fine (eg loading or training a model) in the beginning but whenever Keras is not running anything, the GPU naturally wants to idle from 1097 MHz to 0 MHz and as soon as it does that the GPU crashes. 我注意到的是Keras在一开始就运行良好(例如,加载或训练模型),但是只要Keras没有运行任何东西,GPU自然就会希望从1097 MHz空闲到0 MHz,并且一旦这样做, GPU崩溃。 I can see that the "GPU is lost" on NVSMI. 我可以看到NVSMI上的“ GPU丢失”。 I have to then disable and re-enable my GPU in the Device Manager to get it to work. 然后,我必须在“设备管理器”中禁用并重新启用我的GPU,才能使其正常工作。

Does anyone have any idea why this might be happening? 有谁知道为什么会这样?

Edit: I can temporarily prevent this from happening for very small programs by using the "allow_growth" feature as follows: 编辑:我可以通过使用“ allow_growth”功能暂时阻止非常小的程序发生这种情况,如下所示:

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
set_session(sess)

However, this only works if the operation is really small such that it uses only about 0.1 GB of GPU memory such as loading a model or running a really small model. 但是,只有在操作非常小(例如加载模型或运行非常小的模型)时才使用大约0.1 GB的GPU内存时,此方法才有效。 However, if the program is using memory of even 0.3 GB of memory my GPU crashes since the memory does not go to 0 GB before the clock speed drops to 0 MHz (lower power state). 但是,如果程序使用的内存仅为0.3 GB,那么我的GPU将崩溃,因为在时钟速度降至0 MHz(低功耗状态) 之前 ,内存未变为0 GB。

I was finally able to figure out the issue thanks to someone from another forum. 由于另一个论坛的帮助,我终于能够解决问题。 It was a driver issue. 这是一个驱动程序问题。 The latest drivers provided by Nvidia are causing the issue unlike the old drivers provided by my laptop manufacturer. 与我的笔记本电脑制造商提供的旧驱动程序不同,Nvidia提供的最新驱动程序导致了该问题。

Since I was not able to run tensorflow with my old drivers and do more troubleshooting, what I did was download eDrawings Viewer and open up some random assembly drawings I found online. 由于我无法使用旧的驱动程序运行tensorflow并无法进行更多的故障排除,因此我要做的是下载eDrawings Viewer并打开一些在网上找到的随机装配图。 First I tried with the latest Nvidia drivers, and I see that when I manipulate the models, my card is at P0 state but if I don't do anything and let the software idle, my card goes to a lower power state and crashes my GPU. 首先,我尝试使用最新的Nvidia驱动程序,然后看到在操作模型时,卡处于P0状态,但是如果不执行任何操作并让软件处于空闲状态,则卡会进入低功耗状态并崩溃。 GPU。 But when I did the same exercise with my ASUS manufacturer-certified drivers (since this software was compatible even with the older drivers unlike TF), my GPU did NOT crash. 但是,当我使用华硕制造商认证的驱动程序进行相同的练习时(由于该软件即使与TF不同的旧驱动程序都兼容),所以我的GPU不会崩溃。

What I also discovered was that eDrawings Viewer does not crash even with the latest Nvidia drivers if I go into the Nvidia Control Panel and select "Prefer Maximum Performance" under Power Management Mode. 我还发现,如果我进入Nvidia控制面板并在电源管理模式下选择“首选最高性能”,即使使用最新的Nvidia驱动程序,eDrawings Viewer也不会崩溃。 The card stays at P0 state whenever I have the software open even after idling for minutes. 每当我空闲几分钟后,只要打开软件,卡就保持在P0状态。 Unfortunately, since python.exe does not have a graphical interface, this option does not work for my case. 不幸的是,由于python.exe没有图形界面,因此该选项不适用于我的情况。 As a workaround, I can still run tensorflow without getting it to crash by running eDrawings Viewer in the background (or really any program that uses a graphical interface), which keeps my card at the P0 State. 作为一种解决方法,我仍然可以通过在后台运行eDrawings Viewer(或者实际上是任何使用图形界面的程序)来运行tensorflow而不会使其崩溃,这将我的卡保持在P0状态。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM