繁体   English   中英

由于mAP低,SSD-300不稳定损失

[英]Unstable loss SSD-300 due to which mAP is low

我正在训练我的 SSD-300 模型,我已将图像大小调整为 300x300。 我正在使用 github repo 中提到的默认设置: https : //github.com/balancap/SSD-Tensorflow

训练时损失不稳定。 我尝试将其训练到 50,000 个训练步骤。 我得到的当前 mAP 是 0.26(VOC 2007)和 0.24(VOC 2012)

训练集:1500 张图片测试:300 张图片

当前参数:

!python train_ssd_network.py --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.00004 --optimizer=adam --learning_rate=0.01 --batch_size=2 --gpu_memory_fraction=0.9 --learning_rate_decay_factor=0.94 -num_classes=3  --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --eval_training_data=True

我该怎么做才能获得良好的准确度 (mAP)?

损失的例子,损失甚至达到了80:

W1024 13:57:41.660651 140239494461312 deprecation.py:323] From train_ssd_network.py:256: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

W1024 13:57:41.676577 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W1024 13:57:41.676797 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W1024 13:57:41.677163 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

W1024 13:57:41.677324 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W1024 13:57:41.679852 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W1024 13:57:41.998192 140239494461312 deprecation.py:323] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

W1024 13:57:42.408573 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

WARNING:tensorflow:From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W1024 13:57:42.419716 140239494461312 module_wrapper.py:139] From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W1024 13:57:42.420833 140239494461312 module_wrapper.py:139] From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

W1024 13:57:42.625701 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W1024 13:57:42.629828 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

W1024 13:57:42.630920 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

W1024 13:57:43.817304 140239494461312 module_wrapper.py:139] From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

WARNING:tensorflow:From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

W1024 13:57:43.820022 140239494461312 module_wrapper.py:139] From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1024 13:57:43.820249 140239494461312 module_wrapper.py:139] From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W1024 13:57:43.820408 140239494461312 module_wrapper.py:139] From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

W1024 13:57:43.963253 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W1024 13:57:43.963784 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
I1024 13:57:43.963922 140239494461312 tf_utils.py:230] Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W1024 13:57:44.120857 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-10-24 13:57:44.436826: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-10-24 13:57:44.440876: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2021-10-24 13:57:44.441070: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.441100: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-24 13:57:44.442817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-10-24 13:57:44.554870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.555802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.555833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2021-10-24 13:57:44.556006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.556564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2021-10-24 13:57:44.556867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.558049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-10-24 13:57:44.559113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-10-24 13:57:44.559464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-10-24 13:57:44.560805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-10-24 13:57:44.561773: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-10-24 13:57:44.564919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:44.565038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.565658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.566169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-10-24 13:57:44.566234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.567361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-24 13:57:44.567389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-10-24 13:57:44.567399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-10-24 13:57:44.567544: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568645: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-10-24 13:57:44.568687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
I1024 13:57:45.783673 140239494461312 saver.py:1284] Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
INFO:tensorflow:Running local_init_op.
I1024 13:57:46.017776 140239494461312 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1024 13:57:46.075058 140239494461312 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Starting Session.
I1024 13:57:47.806292 140239494461312 learning.py:754] Starting Session.
INFO:tensorflow:Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
I1024 13:57:47.882676 140237141432064 supervisor.py:1117] Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
INFO:tensorflow:Starting Queues.
I1024 13:57:47.896139 140239494461312 learning.py:768] Starting Queues.
INFO:tensorflow:global_step/sec: 0
I1024 13:57:51.071433 140237149824768 supervisor.py:1099] global_step/sec: 0
2021-10-24 13:57:51.662253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:52.787163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:Recording summary at step 1.
I1024 13:57:55.259541 140235672987392 supervisor.py:1050] Recording summary at step 1.
INFO:tensorflow:global step 10: loss = 9.6467 (0.115 sec/step)
I1024 13:57:56.277639 140239494461312 learning.py:507] global step 10: loss = 9.6467 (0.115 sec/step)
INFO:tensorflow:global step 20: loss = 0.7245 (0.106 sec/step)
I1024 13:57:57.399851 140239494461312 learning.py:507] global step 20: loss = 0.7245 (0.106 sec/step)
INFO:tensorflow:global step 30: loss = 9.5159 (0.109 sec/step)
I1024 13:57:58.544558 140239494461312 learning.py:507] global step 30: loss = 9.5159 (0.109 sec/step)
INFO:tensorflow:global step 40: loss = 0.6637 (0.106 sec/step)
I1024 13:57:59.686780 140239494461312 learning.py:507] global step 40: loss = 0.6637 (0.106 sec/step)
INFO:tensorflow:global step 50: loss = 0.7424 (0.140 sec/step)
I1024 13:58:00.898716 140239494461312 learning.py:507] global step 50: loss = 0.7424 (0.140 sec/step)
INFO:tensorflow:global step 60: loss = 21.9683 (0.141 sec/step)
I1024 13:58:02.276094 140239494461312 learning.py:507] global step 60: loss = 21.9683 (0.141 sec/step)
INFO:tensorflow:global step 70: loss = 0.6486 (0.132 sec/step)
I1024 13:58:03.593588 140239494461312 learning.py:507] global step 70: loss = 0.6486 (0.132 sec/step)
INFO:tensorflow:global step 80: loss = 9.6484 (0.135 sec/step)
I1024 13:58:04.992696 140239494461312 learning.py:507] global step 80: loss = 9.6484 (0.135 sec/step)
INFO:tensorflow:global step 90: loss = 0.6877 (0.114 sec/step)
I1024 13:58:06.135541 140239494461312 learning.py:507] global step 90: loss = 0.6877 (0.114 sec/step)
INFO:tensorflow:global step 100: loss = 4.4349 (0.116 sec/step)
I1024 13:58:07.301742 140239494461312 learning.py:507] global step 100: loss = 4.4349 (0.116 sec/step)

当损失变化并且没有收敛到所需的最小值时,您可以检查几种情况,用于对象检测。

我假设您正在使用自己的自定义对象/类对特定数据集进行微调。

  1. 确保您的边界框非常适合对象。 根据以往的经验,这是进行适当培训的关键点。 花尽可能多的时间来构建具有良好注释的强大数据集。
  2. 与图像相比,边界框的尺寸是多少? 太大或太小的边界框有时难以检测,这取决于所选的超参数,我的解决方法是改变模型边界框生成器的纵横比和比例。
  3. 您的数据集在所有帧中是否相似? 你平均每帧有相同数量的边界框吗? 数据集中可变性的增加使您的模型容易受到弱训练的影响。 因此,如果这是您的情况,您应该考虑使用更多数据扩大数据集以减少可变性,或者通过删除“异常值”仅选择相似的帧。
  4. 您的类/对象数量是否与模型中配置的类数量相对应? 如果您只有 10 个类并且模型是为 100 个构建的,这可能会导致您的情况。
  5. 检查你的学习率。 对于预训练模型,从非常小的学习率(1e-4 或 1e-5)开始。 如果学习率很大,您的模型可能会在相同的局部最小值(不稳定损失)之间反弹。 也尝试实现一个调度程序,每 X 步降低学习率。

还要考虑到你的数据集是“小”的,这使得训练更加困难,有时需要大量的超参数调整,尽管如此,你应该能够通过一些默认配置实现更高的 mAP、mAR 和减少损失。

额外提示,该 repo 有点旧,请考虑在此处此处查看 Tensorflow 对象检测 API。 它提供了更广泛的模型,包括 SOTA 模型,并且配置与您已经在做的非常相似。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM