简体   繁体   English

使用 tensorflow 时 GPU 同步失败

[英]GPU Sync Failed While using tensorflow

I'm trying to run this simple code to test tensorflow我正在尝试运行这个简单的代码来测试 tensorflow

  from __future__ import print_function

    import tensorflow as tf

    a = tf.constant(2)
    b = tf.constant(3)


    with tf.Session() as sess:
        print("a=2, b=3")
        print("Addition with constants: %i" % sess.run(a+b))

But weirdly getting GPU sync failed error.但奇怪的是让 GPU 同步失败错误。

Traceback:追溯:

runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')
a=2, b=3
Traceback (most recent call last):

  File "<ipython-input-5-d4753a508b93>", line 1, in <module>
    runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "D:/tf_examples-master/untitled3.py", line 15, in <module>
    print("Multiplication with constants: %i" % sess.run(a*b))

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
    run_metadata_ptr)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
    run_metadata)

  File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)

InternalError: GPU sync failed

Any help will be appreciated.任何帮助将不胜感激。

When I got this error GPU sync failed .当我收到此错误时GPU sync failed Restarting my notebook/kernel did not help.重新启动我的笔记本/内核没有帮助。

I had another notebook/kernel that was not shutdown and was using my GPU, so to fix this issue all I did was to shutdown the other notebook, restart my current notebook and everything worked!我有另一个笔记本/内核没有关闭并且正在使用我的 GPU,所以要解决这个问题,我所做的就是关闭另一个笔记本,重新启动我当前的笔记本,一切正常!

TLDR : If you find that tensorflow is throwing a GPU sync failed Error , it may be because the model's inputs are too large (as was my case when first running into this problem) or you don't have cuDNN installed properly. TLDR :如果您发现 tensorflow 抛出GPU sync failed Error ,可能是因为模型的输入太大(就像我第一次遇到这个问题时的情况一样),或者您没有正确安装 cuDNN。 Verify that cuDNN is installed correctly and reset your nvidia caches (ie. sudo -rf $HOME/.nv/ ) (if you have no yet done so after initially installing CUDA and cuDNN) and restart your machine.验证 cuDNN 是否已正确安装并重置您的 nvidia 缓存(即sudo -rf $HOME/.nv/ )(如果您在最初安装 CUDA 和 cuDNN 后还没有这样做)并重新启动您的机器。


Running an example found in the tensorflow (TF) docs ( https://www.tensorflow.org/tutorials/keras/save_and_restore_models#checkpoint_callback_usage ), was getting the error运行在 tensorflow (TF) 文档( https://www.tensorflow.org/tutorials/keras/save_and_restore_models#checkpoint_callback_usage )中找到的一个例子,得到了错误

"GPU sync failed Error" “GPU 同步失败错误”

when running a tf.keras model (with a large input (vectorized MNIST feature data (length=28^2))).运行 tf.keras 模型时(具有大输入(矢量化 MNIST 特征数据(长度 = 28^2)))。 Looking into this problem, found this post here ( https://github.com/tensorflow/tensorflow/issues/5688 ) (which talks about the problem being caused specifically by large inputs to a model) and (following the chain of supposed effect) here ( https://github.com/tensorflow/tensorflow/issues/5688 ).调查这个问题,在这里找到了这篇文章( https://github.com/tensorflow/tensorflow/issues/5688 )(它讨论了这个问题是由模型的大量输入引起的)和(遵循假设效应链)这里( https://github.com/tensorflow/tensorflow/issues/5688 )。 The last line of the 2nd post question showing error message snippet显示错误消息片段的第二个帖子问题的最后一行

F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] failed to enqueue convolution on stream: CUDNN_STATUS_NOT_SUPPORTED F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] 无法在流上排队卷积:CUDNN_STATUS_NOT_SUPPORTED

From this, I decided to try and test if (as required by TF) cuDNN was actually installed correctly ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ).由此,我决定尝试测试(根据 TF 的要求)cuDNN 是否实际安装正确( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb )。 Following the docs to try to verify the cuDNN install ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#verify ),按照文档尝试验证 cuDNN 安装( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#verify ),

#Copy the cuDNN sample to a writable path.
$cp -r /usr/src/cudnn_samples_v7/ $HOME
#Go to the writable path.
$ cd  $HOME/cudnn_samples_v7/mnistCUDNN
#Compile the mnistCUDNN sample.
$make clean && make
#Run the mnistCUDNN sample.
$ ./mnistCUDNN
#If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:
Test passed!

found that was throwing error发现是抛出错误

cudnnGetVersion() : 6021 , CUDNN_VERSION from cudnn.h : 6021 (6.0.21)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 20  Capabilities 6.1, SmClock 1797.0 Mhz, MemSize (Mb) 8107, MemClock 5005.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:394
Aborting...

Looking into this more, found nvidiadev threads here ( https://devtalk.nvidia.com/default/topic/1025900/cudnn/cudnn-fails-with-cudnn_status_internal_error-on-mnist-sample-execution/post/5259556/#5259556 ) and here ( https://devtalk.nvidia.com/default/topic/1024761/cuda-setup-and-installation/cudnn_status_internal_error-when-using-cudnn7-0-with-cuda-8-0/post/5217666/#5217666 ), which recommend clearing the nvidia caches via深入研究,在这里找到了 nvidiadev 线程( https://devtalk.nvidia.com/default/topic/1025900/cudnn/cudnn-fails-with-cudnn_status_internal_error-on-mnist-sample-execution/post/5259556/#5259556 ) 和这里 ( https://devtalk.nvidia.com/default/topic/1024761/cuda-setup-and-installation/cudnn_status_internal_error-when-using-cudnn7-0-with-cuda-8-0/post/5217666/ #5217666 ),建议通过以下方式清除 nvidia 缓存

sudo rm -rf ~/.nv/

and restarting (else both installation verification tests for CUDA and cuDNN will fail) my machine.并重新启动(否则 CUDA 和 cuDNN 的安装验证测试都将失败)我的机器。 After doing this, both CUDA ( https://docs.nvidia.com/cuda/archive/9.0/cuda-installation-guide-linux/index.html#install-samples ) and cuDNN ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ) installation checks passed.这样做之后,CUDA ( https://docs.nvidia.com/cuda/archive/9.0/cuda-installation-guide-linux/index.html#install-samples ) 和 cuDNN ( https://docs.nvidia. com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ) 安装检查通过。

And was finally able to successfully run the TF model without error.并且终于能够成功无误地运行TF模型。

model.fit(train_images, train_labels,  
          epochs = 10, 
          validation_data = (test_images, test_labels),
          callbacks = [cp_callback])  # pass callback to training

Train on 1000 samples, validate on 1000 samples Epoch 1/10 1000/1000 [==============================] - 1s 604us/step - loss: 1.1795 - acc: 0.6720 - val_loss: 0.7519 - val_acc: 0.7580训练 1000 个样本,验证 1000 个样本 Epoch 1/10 1000/1000 [==============================] - 1s 604us/步 - 损失:1.1795 - acc:0.6720 - val_loss:0.7519 - val_acc:0.7580

Epoch 00001: saving model to training_1/cp.ckpt WARNING:tensorflow:This model was compiled with a Keras optimizer () but is being saved in TensorFlow format with save_weights . Epoch 00001:将模型保存到 training_1/cp.ckpt 警告:tensorflow:此模型是使用 Keras 优化器 () 编译的,但正在使用save_weights以 TensorFlow 格式保存。 The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.模型的权重将被保存,但与 TensorFlow 格式的 TensorFlow 优化器不同,优化器的状态不会被保存。 ..... .....

Hope this helps you.希望这对你有帮助。

Note : this may be an easy problem to run into, since the tensorflow docs explicitly require that both CUDA and cuDNN be installed for GPU support in TF, but you can actually pip install tensorflow-gpu without installing cuDNN even though this is not the correct thing to do, which (if someone where too eager) could mislead someone to blame something in their code rather than some other underlying installation requirement (which would actually be the right choice in this case).注意:这可能是一个容易遇到的问题,因为 tensorflow 文档明确要求在 TF 中安装 CUDA 和 cuDNN 以支持 GPU,但实际上您可以pip install tensorflow-gpu而不安装 cuDNN,即使这不是正确的要做的事情,这(如果有人过于急切)可能会误导某人将其归咎于他们的代码中的某些内容而不是其他一些潜在的安装要求(在这种情况下这实际上是正确的选择)。

I had the same error我有同样的错误

GPU sync failed GPU 同步失败

today when my CNN had run about 12 hours.今天我的 CNN 已经运行了大约 12 个小时。
Restarting the computer solved this problem temporarily.重启电脑暂时解决了这个问题。

Edited:编辑:

Today I had this error again.今天我又犯了这个错误。 Instead of restarting the computer I restarted IPython console and the error disappeared too.我没有重新启动计算机,而是重新启动了 IPython 控制台,错误也消失了。 It seems in the same python environment tensorflow can no longer find an available GPU.似乎在同一个 python 环境中,tensorflow 再也找不到可用的 GPU。 If the python environment is restarted, everything goes back to normal.如果重新启动python环境,一切都会恢复正常。 I'm using tensorflow-gpu v1.10.0 and cudnn v7.1.4 with GTX 950M.我在 GTX 950M 上使用 tensorflow-gpu v1.10.0 和 cudnn v7.1.4。

This is an older question, but for those that come across this, my fix was different than the other answers.这是一个较旧的问题,但对于那些遇到此问题的人,我的解决方法与其他答案不同。

The code used import schedule to run a Tensorflow model at scheduled times.该代码使用import schedule在预定时间运行 Tensorflow 模型。 The code would run the first time without issue, then on a second run the code would return a代码第一次运行没有问题,然后在第二次运行时,代码将返回一个

GPU sync failed GPU 同步失败

error.错误。 Previously, I had fixed a memory issue using from numba import cuda to release the Tensorflow allocated memory.以前,我已经使用from numba import cuda修复了一个内存问题来释放 Tensorflow 分配的内存。 The code used included a line, cuda.close() as I thought that Tensorflow would reopen a Cuda session at the next run.使用的代码包括一行cuda.close()因为我认为 Tensorflow 会在下次运行时重新打开 Cuda 会话。 I eliminated the line cuda.close() and everything has been working well ever since.我删除了cuda.close()行, cuda.close()一切都运行良好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM