为什么我的 Tensorflow 训练会无限期地挂起而没有错误？

Question

设置 tensorflow 后，验证 gpu 加速是否正常工作，设置配置，本教程中的所有内容https://github.com/nicknochnack/TFODCourse 。

我跑：

py Tensorflow\models\research\object_detection\model_main_tf2.py --model_dir=Tensorflow\workspace\models\my_ssd_mobnet --pipeline_config_path=Tensorflow\workspace\models\my_ssd_mobnet\pipeline.config --num_train_steps=100

并获取这些输出日志，等待一个多小时，Python 持续使用我的 CPU 的 25-26%，但从未打印任何进度日志，即使我将步数降低到 100，我也没有得到任何信息：

有一堆警告，但也许这是正常的？ 我用谷歌搜索了一些 INFO 日志，发现它们是无害的。 从这些日志中我似乎遗漏了什么或做错了什么？ 以下是删除了未来弃用警告的删节日志：

2021-07-11 02:25:42.869766: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
    py Tensorflow\models\research\object_detection\model_main_tf2.py --model_dir=Tensorflow\workspace\models\my_ssd_mobnet --pipeline_config_path=Tensorflow\workspace\models\my_ssd_mobnet\pipeline.config --num_train_steps=100
    2021-07-11 02:25:44.989884: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
    2021-07-11 02:25:47.588384: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
    2021-07-11 02:25:47.605286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2080 SUPER computeCapability: 7.5
    coreClock: 1.845GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 462.00GiB/s
    2021-07-11 02:25:47.605366: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
    2021-07-11 02:25:47.610303: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
    2021-07-11 02:25:47.610390: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
    2021-07-11 02:25:47.613585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
    2021-07-11 02:25:47.614873: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
    2021-07-11 02:25:47.621607: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
    2021-07-11 02:25:47.623967: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
    2021-07-11 02:25:47.624496: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
    2021-07-11 02:25:47.626311: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
    2021-07-11 02:25:47.626728: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2021-07-11 02:25:47.627707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2080 SUPER computeCapability: 7.5
    coreClock: 1.845GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 462.00GiB/s
    2021-07-11 02:25:47.627810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
    2021-07-11 02:25:48.067610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
    2021-07-11 02:25:48.067778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0
    2021-07-11 02:25:48.068662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N
    2021-07-11 02:25:48.069323: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5957 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
    WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
    W0711 02:25:48.071784 10384 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.
    INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
    I0711 02:25:48.225363 10384 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
    INFO:tensorflow:Maybe overwriting train_steps: 100
    I0711 02:25:48.229352 10384 config_util.py:552] Maybe overwriting train_steps: 100
    INFO:tensorflow:Maybe overwriting use_bfloat16: False
    I0711 02:25:48.230349 10384 config_util.py:552] Maybe overwriting use_bfloat16: False
    INFO:tensorflow:Reading unweighted datasets: ['Tensorflow\\workspace\\annotations\\train.record']
    I0711 02:25:48.308165 10384 dataset_builder.py:163] Reading unweighted datasets: ['Tensorflow\\workspace\\annotations\\train.record']
    INFO:tensorflow:Reading record datasets for input file: ['Tensorflow\\workspace\\annotations\\train.record']
    I0711 02:25:48.309138 10384 dataset_builder.py:80] Reading record datasets for input file: ['Tensorflow\\workspace\\annotations\\train.record']
    INFO:tensorflow:Number of filenames to read: 1
    I0711 02:25:48.311132 10384 dataset_builder.py:81] Number of filenames to read: 1
    WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
    W0711 02:25:48.311132 10384 dataset_builder.py:87] num_readers has been reduced to 1 to match input file shards.
 tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)

包括未来弃用警告在内的完整日志都在这个要点中，但同样，唯一的区别是未来弃用警告，没有什么应该被打破。

我只是不确定如何调试它。 看起来它正在工作，然后就挂了。

Answer 1

我在其他人遇到同样问题的 GitHub 问题中找到了解决方案。

https://github.com/tensorflow/models/issues/9581

问题是我的 TFRecord 生成脚本找不到任何图像并创建了空记录文件。 不幸的是，在这种情况下，生成脚本和 Tensorflow 都默默地失败了。

为什么我的 Tensorflow 训练会无限期地挂起而没有错误？

问题描述

1 个解决方案

解决方案1
0 2021-07-11 14:28:39

为什么我的 Tensorflow 训练会无限期地挂起而没有错误？

问题描述

1 个解决方案

解决方案1 0 2021-07-11 14:28:39

解决方案1
0 2021-07-11 14:28:39