[英]GPU Sync Failed While using tensorflow
I'm trying to run this simple code to test tensorflow我正在尝试运行这个简单的代码来测试 tensorflow
from __future__ import print_function
import tensorflow as tf
a = tf.constant(2)
b = tf.constant(3)
with tf.Session() as sess:
print("a=2, b=3")
print("Addition with constants: %i" % sess.run(a+b))
But weirdly getting GPU sync failed error.但奇怪的是让 GPU 同步失败错误。
Traceback:追溯:
runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')
a=2, b=3
Traceback (most recent call last):
File "<ipython-input-5-d4753a508b93>", line 1, in <module>
runfile('D:/tf_examples-master/untitled3.py', wdir='D:/tf_examples-master')
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/tf_examples-master/untitled3.py", line 15, in <module>
print("Multiplication with constants: %i" % sess.run(a*b))
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 900, in run
run_metadata_ptr)
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1135, in _run
feed_dict_tensor, options, run_metadata)
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1316, in _do_run
run_metadata)
File "C:\ProgramData\Anaconda3\envs\env3-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1335, in _do_call
raise type(e)(node_def, op, message)
InternalError: GPU sync failed
Any help will be appreciated.任何帮助将不胜感激。
When I got this error GPU sync failed
.当我收到此错误时
GPU sync failed
。 Restarting my notebook/kernel did not help.重新启动我的笔记本/内核没有帮助。
I had another notebook/kernel that was not shutdown and was using my GPU, so to fix this issue all I did was to shutdown the other notebook, restart my current notebook and everything worked!我有另一个笔记本/内核没有关闭并且正在使用我的 GPU,所以要解决这个问题,我所做的就是关闭另一个笔记本,重新启动我当前的笔记本,一切正常!
TLDR : If you find that tensorflow is throwing a GPU sync failed Error
, it may be because the model's inputs are too large (as was my case when first running into this problem) or you don't have cuDNN installed properly. TLDR :如果您发现 tensorflow 抛出
GPU sync failed Error
,可能是因为模型的输入太大(就像我第一次遇到这个问题时的情况一样),或者您没有正确安装 cuDNN。 Verify that cuDNN is installed correctly and reset your nvidia caches (ie. sudo -rf $HOME/.nv/
) (if you have no yet done so after initially installing CUDA and cuDNN) and restart your machine.验证 cuDNN 是否已正确安装并重置您的 nvidia 缓存(即
sudo -rf $HOME/.nv/
)(如果您在最初安装 CUDA 和 cuDNN 后还没有这样做)并重新启动您的机器。
Running an example found in the tensorflow (TF) docs ( https://www.tensorflow.org/tutorials/keras/save_and_restore_models#checkpoint_callback_usage ), was getting the error运行在 tensorflow (TF) 文档( https://www.tensorflow.org/tutorials/keras/save_and_restore_models#checkpoint_callback_usage )中找到的一个例子,得到了错误
"GPU sync failed Error"
“GPU 同步失败错误”
when running a tf.keras model (with a large input (vectorized MNIST feature data (length=28^2))).运行 tf.keras 模型时(具有大输入(矢量化 MNIST 特征数据(长度 = 28^2)))。 Looking into this problem, found this post here ( https://github.com/tensorflow/tensorflow/issues/5688 ) (which talks about the problem being caused specifically by large inputs to a model) and (following the chain of supposed effect) here ( https://github.com/tensorflow/tensorflow/issues/5688 ).
调查这个问题,在这里找到了这篇文章( https://github.com/tensorflow/tensorflow/issues/5688 )(它讨论了这个问题是由模型的大量输入引起的)和(遵循假设效应链)这里( https://github.com/tensorflow/tensorflow/issues/5688 )。 The last line of the 2nd post question showing error message snippet
显示错误消息片段的第二个帖子问题的最后一行
F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] failed to enqueue convolution on stream: CUDNN_STATUS_NOT_SUPPORTED
F tensorflow/stream_executor/cuda/cuda_dnn.cc:2440] 无法在流上排队卷积:CUDNN_STATUS_NOT_SUPPORTED
From this, I decided to try and test if (as required by TF) cuDNN was actually installed correctly ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ).由此,我决定尝试测试(根据 TF 的要求)cuDNN 是否实际安装正确( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb )。 Following the docs to try to verify the cuDNN install ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#verify ),
按照文档尝试验证 cuDNN 安装( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#verify ),
#Copy the cuDNN sample to a writable path.
$cp -r /usr/src/cudnn_samples_v7/ $HOME
#Go to the writable path.
$ cd $HOME/cudnn_samples_v7/mnistCUDNN
#Compile the mnistCUDNN sample.
$make clean && make
#Run the mnistCUDNN sample.
$ ./mnistCUDNN
#If cuDNN is properly installed and running on your Linux system, you will see a message similar to the following:
Test passed!
found that was throwing error发现是抛出错误
cudnnGetVersion() : 6021 , CUDNN_VERSION from cudnn.h : 6021 (6.0.21)
Host compiler version : GCC 5.4.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 20 Capabilities 6.1, SmClock 1797.0 Mhz, MemSize (Mb) 8107, MemClock 5005.0 Mhz, Ecc=0, boardGroupID=0
Using device 0
Testing single precision
CUDNN failure
Error: CUDNN_STATUS_INTERNAL_ERROR
mnistCUDNN.cpp:394
Aborting...
Looking into this more, found nvidiadev threads here ( https://devtalk.nvidia.com/default/topic/1025900/cudnn/cudnn-fails-with-cudnn_status_internal_error-on-mnist-sample-execution/post/5259556/#5259556 ) and here ( https://devtalk.nvidia.com/default/topic/1024761/cuda-setup-and-installation/cudnn_status_internal_error-when-using-cudnn7-0-with-cuda-8-0/post/5217666/#5217666 ), which recommend clearing the nvidia caches via深入研究,在这里找到了 nvidiadev 线程( https://devtalk.nvidia.com/default/topic/1025900/cudnn/cudnn-fails-with-cudnn_status_internal_error-on-mnist-sample-execution/post/5259556/#5259556 ) 和这里 ( https://devtalk.nvidia.com/default/topic/1024761/cuda-setup-and-installation/cudnn_status_internal_error-when-using-cudnn7-0-with-cuda-8-0/post/5217666/ #5217666 ),建议通过以下方式清除 nvidia 缓存
sudo rm -rf ~/.nv/
and restarting (else both installation verification tests for CUDA and cuDNN will fail) my machine.并重新启动(否则 CUDA 和 cuDNN 的安装验证测试都将失败)我的机器。 After doing this, both CUDA ( https://docs.nvidia.com/cuda/archive/9.0/cuda-installation-guide-linux/index.html#install-samples ) and cuDNN ( https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ) installation checks passed.
这样做之后,CUDA ( https://docs.nvidia.com/cuda/archive/9.0/cuda-installation-guide-linux/index.html#install-samples ) 和 cuDNN ( https://docs.nvidia. com/deeplearning/sdk/cudnn-install/index.html#installlinux-deb ) 安装检查通过。
And was finally able to successfully run the TF model without error.并且终于能够成功无误地运行TF模型。
model.fit(train_images, train_labels,
epochs = 10,
validation_data = (test_images, test_labels),
callbacks = [cp_callback]) # pass callback to training
Train on 1000 samples, validate on 1000 samples Epoch 1/10 1000/1000 [==============================] - 1s 604us/step - loss: 1.1795 - acc: 0.6720 - val_loss: 0.7519 - val_acc: 0.7580
训练 1000 个样本,验证 1000 个样本 Epoch 1/10 1000/1000 [==============================] - 1s 604us/步 - 损失:1.1795 - acc:0.6720 - val_loss:0.7519 - val_acc:0.7580
Epoch 00001: saving model to training_1/cp.ckpt WARNING:tensorflow:This model was compiled with a Keras optimizer () but is being saved in TensorFlow format with
save_weights
.Epoch 00001:将模型保存到 training_1/cp.ckpt 警告:tensorflow:此模型是使用 Keras 优化器 () 编译的,但正在使用
save_weights
以 TensorFlow 格式保存。 The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.模型的权重将被保存,但与 TensorFlow 格式的 TensorFlow 优化器不同,优化器的状态不会被保存。 .....
.....
Hope this helps you.希望这对你有帮助。
Note : this may be an easy problem to run into, since the tensorflow docs explicitly require that both CUDA and cuDNN be installed for GPU support in TF, but you can actually pip install tensorflow-gpu
without installing cuDNN even though this is not the correct thing to do, which (if someone where too eager) could mislead someone to blame something in their code rather than some other underlying installation requirement (which would actually be the right choice in this case).注意:这可能是一个容易遇到的问题,因为 tensorflow 文档明确要求在 TF 中安装 CUDA 和 cuDNN 以支持 GPU,但实际上您可以
pip install tensorflow-gpu
而不安装 cuDNN,即使这不是正确的要做的事情,这(如果有人过于急切)可能会误导某人将其归咎于他们的代码中的某些内容而不是其他一些潜在的安装要求(在这种情况下这实际上是正确的选择)。
I had the same error我有同样的错误
GPU sync failed
GPU 同步失败
today when my CNN had run about 12 hours.今天我的 CNN 已经运行了大约 12 个小时。
Restarting the computer solved this problem temporarily.重启电脑暂时解决了这个问题。
Edited:编辑:
Today I had this error again.今天我又犯了这个错误。 Instead of restarting the computer I restarted IPython console and the error disappeared too.
我没有重新启动计算机,而是重新启动了 IPython 控制台,错误也消失了。 It seems in the same python environment tensorflow can no longer find an available GPU.
似乎在同一个 python 环境中,tensorflow 再也找不到可用的 GPU。 If the python environment is restarted, everything goes back to normal.
如果重新启动python环境,一切都会恢复正常。 I'm using tensorflow-gpu v1.10.0 and cudnn v7.1.4 with GTX 950M.
我在 GTX 950M 上使用 tensorflow-gpu v1.10.0 和 cudnn v7.1.4。
This is an older question, but for those that come across this, my fix was different than the other answers.这是一个较旧的问题,但对于那些遇到此问题的人,我的解决方法与其他答案不同。
The code used import schedule
to run a Tensorflow model at scheduled times.该代码使用
import schedule
在预定时间运行 Tensorflow 模型。 The code would run the first time without issue, then on a second run the code would return a代码第一次运行没有问题,然后在第二次运行时,代码将返回一个
GPU sync failed
GPU 同步失败
error.错误。 Previously, I had fixed a memory issue using
from numba import cuda
to release the Tensorflow allocated memory.以前,我已经使用
from numba import cuda
修复了一个内存问题来释放 Tensorflow 分配的内存。 The code used included a line, cuda.close()
as I thought that Tensorflow would reopen a Cuda session at the next run.使用的代码包括一行
cuda.close()
因为我认为 Tensorflow 会在下次运行时重新打开 Cuda 会话。 I eliminated the line cuda.close()
and everything has been working well ever since.我删除了
cuda.close()
行, cuda.close()
一切都运行良好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.