简体   繁体   English

初始v3再培训错误(花例子)

[英]Inception v3 retraining error (flower example)

I'm currently facing a weird bug with the flower retraining example ( https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html ). 我目前正面临花卉再培训示例( https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html )的一个奇怪的错误。

Tensorflow Release 0.9 was installed from source and I tried to run the image_retraining python script (it does start and create a few bottlenecks but then the following error message appears). Tensorflow Release 0.9是从源代码安装的,我尝试运行image_retraining python脚本(它确实启动并创建了一些瓶颈,但随后出现以下错误消息)。

Might anyone have an idea what the problem could be? 可能有人知道问题可能是什么? I didn't find any similar posts to this. 我没有找到任何类似的帖子。

E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x10007200300 = {1, 0} activation input is not finite.
Traceback (most recent call last):
  File "examples/image_retraining/retrain.py", line 888, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "examples/image_retraining/retrain.py", line 798, in main
    jpeg_data_tensor, bottleneck_tensor)
  File "examples/image_retraining/retrain.py", line 456, in cache_bottlenecks
    jpeg_data_tensor, bottleneck_tensor)
  File "examples/image_retraining/retrain.py", line 414, in get_or_create_bottleneck
    bottleneck_tensor)
  File "examples/image_retraining/retrain.py", line 331, in run_bottleneck_on_image
    {image_data_tensor: image_data})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 382, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 655, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 723, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 743, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: activation input is not finite. : Tensor had NaN values
         [[Node: conv_1/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="activation input is not finite.", _device="/job:localhost/replica:0/task:0/gpu:0"](conv_1/batchnorm)]]
Caused by op u'conv_1/CheckNumerics', defined at:
  File "examples/image_retraining/retrain.py", line 888, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "examples/image_retraining/retrain.py", line 769, in main
    create_inception_graph())
  File "examples/image_retraining/retrain.py", line 312, in create_inception_graph
    RESIZED_INPUT_TENSOR_NAME]))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/importer.py", line 274, in import_graph_def
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2297, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1231, in __init__
    self._traceback = _extract_stack()

Update: Just to follow up, Tensorflow 1.6 is recommended as many operations are much faster. 更新:为了跟进,建议使用Tensorflow 1.6,因为许多操作要快得多。 If you are running an Nvidia GPU, make sure you install CUDA 9.0 and not 9.1, 9.1 will break everything. 如果您运行的是Nvidia GPU,请确保安装CUDA 9.0而不安装9.1,9.1会破坏所有内容。

For cuDNN, it needs to match both CUDA 9.0 and also the version that Tensorflow was built with. 对于cuDNN,它需要匹配CUDA 9.0以及构建Tensorflow的版本。 For Tensorflow 1.6, be sure to install version 7.0.4, not 7.1, and the specific version that 1.6 was built with (otherwise, it will also break): The exact version is cuDNN v7.0.4.31-1 for CUDA 9.0 (not 9.1). 对于Tensorflow 1.6,请务必安装版本7.0.4,而不是7.1,以及1.6构建的特定版本(否则,它也会中断):CUDA 9.0的确切版本为cuDNN v7.0.4.31-1(不是9.1)。 The latest versions (7.1.2 at this time) will throw errors as Tensorflow 1.6 was built with 7.0.4 最新版本(此时为7.1.2)将抛出错误,因为Tensorflow 1.6是使用7.0.4构建的

Original post: This is a bug in TensorFlow that I have also encountered (I'm using 2x GTX 1080 in Ubuntu 14.04) 原帖:这是我遇到的TensorFlow中的一个错误(我在Ubuntu 14.04中使用2x GTX 1080)

One option is to install Cuda 8.0. 一种选择是安装Cuda 8.0。 However, Cuda 8.0 isn't fully supported and you may encounter other issues. 但是,Cuda 8.0不完全受支持,您可能会遇到其他问题。

Another way to work around this if you are just experimenting is to build it and run it on CPUs only, at least for the bottleneck generation phase. 如果您只是试验,解决此问题的另一种方法是构建它并仅在CPU上运行它,至少在瓶颈生成阶段。

bazel build -c opt --copt=-mavx tensorflow/examples/image_retraining:retrain
bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir ~/flower_photos

As you probably know, if you've built TensorFlow with GPU support and then run this: 您可能知道,如果您已经构建了支持GPU的TensorFlow,那么运行它:

python tensorflow/examples/image_retraining/retrain.py --image_dir ~/flower_photos

it will run with GPU support and then you'll probably hit the same error. 它将在GPU支持下运行,然后你可能会遇到同样的错误。

I've opened an issue here: https://github.com/tensorflow/tensorflow/issues/3560 我在这里打开了一个问题: https//github.com/tensorflow/tensorflow/issues/3560

Until they fix it, the workaround works as long as you don't have a large number of categories to classify for. 在他们修复之前,只要您没有大量要分类的类别,解决方法就可以正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM