由于mAP低，SSD-300不稳定损失

Question

我正在训练我的 SSD-300 模型，我已将图像大小调整为 300x300。 我正在使用 github repo 中提到的默认设置： https : //github.com/balancap/SSD-Tensorflow

训练时损失不稳定。 我尝试将其训练到 50,000 个训练步骤。 我得到的当前 mAP 是 0.26（VOC 2007）和 0.24（VOC 2012）

训练集：1500 张图片测试：300 张图片

当前参数：

!python train_ssd_network.py --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.00004 --optimizer=adam --learning_rate=0.01 --batch_size=2 --gpu_memory_fraction=0.9 --learning_rate_decay_factor=0.94 -num_classes=3  --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --eval_training_data=True

我该怎么做才能获得良好的准确度 (mAP)？

损失的例子，损失甚至达到了80：

W1024 13:57:41.660651 140239494461312 deprecation.py:323] From train_ssd_network.py:256: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

W1024 13:57:41.676577 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W1024 13:57:41.676797 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W1024 13:57:41.677163 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

W1024 13:57:41.677324 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W1024 13:57:41.679852 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W1024 13:57:41.998192 140239494461312 deprecation.py:323] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

W1024 13:57:42.408573 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

WARNING:tensorflow:From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W1024 13:57:42.419716 140239494461312 module_wrapper.py:139] From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W1024 13:57:42.420833 140239494461312 module_wrapper.py:139] From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

W1024 13:57:42.625701 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W1024 13:57:42.629828 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

W1024 13:57:42.630920 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

W1024 13:57:43.817304 140239494461312 module_wrapper.py:139] From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

WARNING:tensorflow:From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

W1024 13:57:43.820022 140239494461312 module_wrapper.py:139] From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1024 13:57:43.820249 140239494461312 module_wrapper.py:139] From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W1024 13:57:43.820408 140239494461312 module_wrapper.py:139] From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

W1024 13:57:43.963253 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W1024 13:57:43.963784 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
I1024 13:57:43.963922 140239494461312 tf_utils.py:230] Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W1024 13:57:44.120857 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-10-24 13:57:44.436826: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-10-24 13:57:44.440876: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2021-10-24 13:57:44.441070: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.441100: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-24 13:57:44.442817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-10-24 13:57:44.554870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.555802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.555833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2021-10-24 13:57:44.556006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.556564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2021-10-24 13:57:44.556867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.558049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-10-24 13:57:44.559113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-10-24 13:57:44.559464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-10-24 13:57:44.560805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-10-24 13:57:44.561773: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-10-24 13:57:44.564919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:44.565038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.565658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.566169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-10-24 13:57:44.566234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.567361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-24 13:57:44.567389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-10-24 13:57:44.567399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-10-24 13:57:44.567544: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568645: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-10-24 13:57:44.568687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
I1024 13:57:45.783673 140239494461312 saver.py:1284] Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
INFO:tensorflow:Running local_init_op.
I1024 13:57:46.017776 140239494461312 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1024 13:57:46.075058 140239494461312 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Starting Session.
I1024 13:57:47.806292 140239494461312 learning.py:754] Starting Session.
INFO:tensorflow:Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
I1024 13:57:47.882676 140237141432064 supervisor.py:1117] Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
INFO:tensorflow:Starting Queues.
I1024 13:57:47.896139 140239494461312 learning.py:768] Starting Queues.
INFO:tensorflow:global_step/sec: 0
I1024 13:57:51.071433 140237149824768 supervisor.py:1099] global_step/sec: 0
2021-10-24 13:57:51.662253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:52.787163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:Recording summary at step 1.
I1024 13:57:55.259541 140235672987392 supervisor.py:1050] Recording summary at step 1.
INFO:tensorflow:global step 10: loss = 9.6467 (0.115 sec/step)
I1024 13:57:56.277639 140239494461312 learning.py:507] global step 10: loss = 9.6467 (0.115 sec/step)
INFO:tensorflow:global step 20: loss = 0.7245 (0.106 sec/step)
I1024 13:57:57.399851 140239494461312 learning.py:507] global step 20: loss = 0.7245 (0.106 sec/step)
INFO:tensorflow:global step 30: loss = 9.5159 (0.109 sec/step)
I1024 13:57:58.544558 140239494461312 learning.py:507] global step 30: loss = 9.5159 (0.109 sec/step)
INFO:tensorflow:global step 40: loss = 0.6637 (0.106 sec/step)
I1024 13:57:59.686780 140239494461312 learning.py:507] global step 40: loss = 0.6637 (0.106 sec/step)
INFO:tensorflow:global step 50: loss = 0.7424 (0.140 sec/step)
I1024 13:58:00.898716 140239494461312 learning.py:507] global step 50: loss = 0.7424 (0.140 sec/step)
INFO:tensorflow:global step 60: loss = 21.9683 (0.141 sec/step)
I1024 13:58:02.276094 140239494461312 learning.py:507] global step 60: loss = 21.9683 (0.141 sec/step)
INFO:tensorflow:global step 70: loss = 0.6486 (0.132 sec/step)
I1024 13:58:03.593588 140239494461312 learning.py:507] global step 70: loss = 0.6486 (0.132 sec/step)
INFO:tensorflow:global step 80: loss = 9.6484 (0.135 sec/step)
I1024 13:58:04.992696 140239494461312 learning.py:507] global step 80: loss = 9.6484 (0.135 sec/step)
INFO:tensorflow:global step 90: loss = 0.6877 (0.114 sec/step)
I1024 13:58:06.135541 140239494461312 learning.py:507] global step 90: loss = 0.6877 (0.114 sec/step)
INFO:tensorflow:global step 100: loss = 4.4349 (0.116 sec/step)
I1024 13:58:07.301742 140239494461312 learning.py:507] global step 100: loss = 4.4349 (0.116 sec/step)

Answer 1

当损失变化并且没有收敛到所需的最小值时，您可以检查几种情况，用于对象检测。

我假设您正在使用自己的自定义对象/类对特定数据集进行微调。

确保您的边界框非常适合对象。 根据以往的经验，这是进行适当培训的关键点。 花尽可能多的时间来构建具有良好注释的强大数据集。
与图像相比，边界框的尺寸是多少？ 太大或太小的边界框有时难以检测，这取决于所选的超参数，我的解决方法是改变模型边界框生成器的纵横比和比例。
您的数据集在所有帧中是否相似？ 你平均每帧有相同数量的边界框吗？ 数据集中可变性的增加使您的模型容易受到弱训练的影响。 因此，如果这是您的情况，您应该考虑使用更多数据扩大数据集以减少可变性，或者通过删除“异常值”仅选择相似的帧。
您的类/对象数量是否与模型中配置的类数量相对应？ 如果您只有 10 个类并且模型是为 100 个构建的，这可能会导致您的情况。
检查你的学习率。 对于预训练模型，从非常小的学习率（1e-4 或 1e-5）开始。 如果学习率很大，您的模型可能会在相同的局部最小值（不稳定损失）之间反弹。 也尝试实现一个调度程序，每 X 步降低学习率。

还要考虑到你的数据集是“小”的，这使得训练更加困难，有时需要大量的超参数调整，尽管如此，你应该能够通过一些默认配置实现更高的 mAP、mAR 和减少损失。

额外提示，该 repo 有点旧，请考虑在此处和此处查看 Tensorflow 对象检测 API。 它提供了更广泛的模型，包括 SOTA 模型，并且配置与您已经在做的非常相似。

由于mAP低，SSD-300不稳定损失

问题描述

1 个解决方案

解决方案1
0 2021-10-24 20:28:16

由于mAP低，SSD-300不稳定损失

问题描述

1 个解决方案

解决方案1 0 2021-10-24 20:28:16

解决方案1
0 2021-10-24 20:28:16