Tensorflow 自定义 Object 检测器：model_main_tf2 未开始训练

Question

Problem summary : The tensorflow custom object detector never starts fine-tuning when i follow the guide in docs.问题摘要：当我按照文档中的指南进行操作时，tensorflow 自定义 object 检测器从未开始微调。 It doesn't throw an exception either.它也不会抛出异常。

What i've done: I have installed the object detector api and run a succesful test as according to the docs .我做了什么：我已经安装了 object 检测器 api 并根据文档运行了一个成功的测试。

I then followed the guide about training a custom object detector algorithm here , including modifying the pipeline.config file.然后，我按照此处有关训练自定义 object 检测器算法的指南进行操作，包括修改 pipeline.config 文件。 As per the guide i run按照我运行的指南

model_main_tf2.py  --model_dir=<path1> --pipeline_config_path=<path2> --alsologtostderr

where path1 and path2 are paths like其中 path1 和 path2 是类似的路径

 D:/COCO/models/workspace/duck-demo/pre-trained-models/efficientdet_d1_coco17_tpu-32/pipeline.config

The output is shown below. output如下所示。 The output, including its many warnings, is expected output as per the guide.根据指南，output（包括其许多警告）预计为 output。 However, it was expected to start training afterwards.不过，预计之后会开始训练。 Instead it just returns, without error nor training.相反，它只是返回，没有错误也没有训练。 What seems to be the problem here?这里似乎有什么问题？

output: output：

...
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.gamma
W0326 09:24:46.180965 16300 util.py:160] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.beta
W0326 09:24:46.180965 16300 util.py:160] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.beta
...
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. 
Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0326 09:24:46.181965 16300 util.py:168] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. 
Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

Answer 1

There is a GitHub issue here with many possible solutions being discussed for different types of TensorFlow 2 models for your problem.这里有一个 GitHub 问题，针对您的问题，针对不同类型的 TensorFlow 2 模型讨论了许多可能的解决方案。 There's a good chance one of them would help.他们中的一个很有可能会有所帮助。

Just as a rule of thumb, it's a good idea to always test your installation by running the command python object_detection/builders/model_builder_tf2_test.py before actually proceeding to train anything to diagnose any possible issues early根据经验，最好始终通过运行命令python object_detection/builders/model_builder_tf2_test.py来测试您的安装，然后再实际进行任何训练以及早诊断任何可能的问题

Answer 2

I had a similar issue, just posting here because after many agonizing hours I finally resolved it, so I wanted to post my fix in case it helps somewhere else.我有一个类似的问题，只是在这里发布，因为经过许多痛苦的时间我终于解决了它，所以我想发布我的修复以防它对其他地方有帮助。

After spending too much time thinking the issue was with my model variables, I went back and went over every step in the TF object detection api installation to make sure nothing was wrong.在花了太多时间认为问题出在我的 model 变量之后，我返回并检查了 TF object 检测 api 安装中的每一步，以确保没有错误。 I then found that my Windows 10 environmental variables was pointed at the CUDA 11.2 version but I had 11.3 installed, I changed the path and it worked perfectly.然后我发现我的 Windows 10 环境变量指向 CUDA 11.2 版本，但我安装了 11.3，我更改了路径并且它运行良好。 I would recommend anyone who isn't getting an error message to check they installed the environment correctly.我会建议任何没有收到错误消息的人检查他们是否正确安装了环境。

Answer 3

Just wait, it can take a while and this is something the developers warned about :稍等，这可能需要一段时间，这是开发人员警告过的：

The output will normally look like it has “frozen”, but DO NOT rush to cancel the process. output 通常看起来像是“冻结”了，但不要急于取消该过程。 The training outputs logs only every 100 steps by default, therefore if you wait for a while, you should see a log for the loss at step 100.默认情况下，训练仅每 100 步输出一次日志，因此，如果您等待一段时间，您应该会在第 100 步看到损失日志。

The time you should wait can vary greatly, depending on whether you are using a GPU and the chosen value for batch_size in the config file, so be patient.您应该等待的时间可能会有很大差异，具体取决于您是否使用 GPU 以及配置文件中为 batch_size 选择的值，因此请耐心等待。

If it's not crashing, it seems like it's working.如果它没有崩溃，它似乎正在工作。 There's a logging parameter you can change in model_main_tf2.py somewhere.您可以在model_main_tf2.py某处更改日志记录参数。 You can decrease from 100 to like 5 or 10 if you want to see verbose more frequently.如果您想更频繁地查看详细信息，可以从 100 减少到喜欢 5 或 10。

Tensorflow 自定义 Object 检测器：model_main_tf2 未开始训练

问题描述

3 个解决方案

解决方案1
1 已采纳 2021-03-27 12:37:18

解决方案2
1 2022-04-09 22:12:40

解决方案3
0 2021-03-26 12:41:51

Tensorflow 自定义 Object 检测器：model_main_tf2 未开始训练

问题描述

3 个解决方案

解决方案1 1 已采纳 2021-03-27 12:37:18

解决方案2 1 2022-04-09 22:12:40

解决方案3 0 2021-03-26 12:41:51

解决方案1
1 已采纳 2021-03-27 12:37:18

解决方案2
1 2022-04-09 22:12:40

解决方案3
0 2021-03-26 12:41:51