简体   繁体   English

由于mAP低,SSD-300不稳定损失

[英]Unstable loss SSD-300 due to which mAP is low

I am training my SSD-300 model for which I have resized images to 300x300.我正在训练我的 SSD-300 模型,我已将图像大小调整为 300x300。 I am using the default settings as mentioned in github repo: https://github.com/balancap/SSD-Tensorflow我正在使用 github repo 中提到的默认设置: https : //github.com/balancap/SSD-Tensorflow

The loss is unstable while training.训练时损失不稳定。 I tried training it till 50,000 training steps.我尝试将其训练到 50,000 个训练步骤。 The current mAP that I am getting is 0.26(VOC 2007) and 0.24 (VOC 2012)我得到的当前 mAP 是 0.26(VOC 2007)和 0.24(VOC 2012)

Train set: 1500 images Test: 300 images训练集:1500 张图片测试:300 张图片

Current parameters:当前参数:

!python train_ssd_network.py --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.00004 --optimizer=adam --learning_rate=0.01 --batch_size=2 --gpu_memory_fraction=0.9 --learning_rate_decay_factor=0.94 -num_classes=3  --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --eval_training_data=True

What can I do so that I get a good accuracy (mAP)?我该怎么做才能获得良好的准确度 (mAP)?

Example of loss, the loss even reached to 80:损失的例子,损失甚至达到了80:

W1024 13:57:41.660651 140239494461312 deprecation.py:323] From train_ssd_network.py:256: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

W1024 13:57:41.676577 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W1024 13:57:41.676797 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W1024 13:57:41.677163 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

W1024 13:57:41.677324 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W1024 13:57:41.679852 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W1024 13:57:41.998192 140239494461312 deprecation.py:323] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

W1024 13:57:42.408573 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

WARNING:tensorflow:From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W1024 13:57:42.419716 140239494461312 module_wrapper.py:139] From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W1024 13:57:42.420833 140239494461312 module_wrapper.py:139] From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

W1024 13:57:42.625701 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W1024 13:57:42.629828 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

W1024 13:57:42.630920 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

W1024 13:57:43.817304 140239494461312 module_wrapper.py:139] From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

WARNING:tensorflow:From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

W1024 13:57:43.820022 140239494461312 module_wrapper.py:139] From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1024 13:57:43.820249 140239494461312 module_wrapper.py:139] From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W1024 13:57:43.820408 140239494461312 module_wrapper.py:139] From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

W1024 13:57:43.963253 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W1024 13:57:43.963784 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
I1024 13:57:43.963922 140239494461312 tf_utils.py:230] Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W1024 13:57:44.120857 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-10-24 13:57:44.436826: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-10-24 13:57:44.440876: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2021-10-24 13:57:44.441070: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.441100: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-24 13:57:44.442817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-10-24 13:57:44.554870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.555802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.555833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2021-10-24 13:57:44.556006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.556564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2021-10-24 13:57:44.556867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.558049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-10-24 13:57:44.559113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-10-24 13:57:44.559464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-10-24 13:57:44.560805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-10-24 13:57:44.561773: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-10-24 13:57:44.564919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:44.565038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.565658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.566169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-10-24 13:57:44.566234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.567361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-24 13:57:44.567389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-10-24 13:57:44.567399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-10-24 13:57:44.567544: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568645: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-10-24 13:57:44.568687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
I1024 13:57:45.783673 140239494461312 saver.py:1284] Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
INFO:tensorflow:Running local_init_op.
I1024 13:57:46.017776 140239494461312 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1024 13:57:46.075058 140239494461312 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Starting Session.
I1024 13:57:47.806292 140239494461312 learning.py:754] Starting Session.
INFO:tensorflow:Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
I1024 13:57:47.882676 140237141432064 supervisor.py:1117] Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
INFO:tensorflow:Starting Queues.
I1024 13:57:47.896139 140239494461312 learning.py:768] Starting Queues.
INFO:tensorflow:global_step/sec: 0
I1024 13:57:51.071433 140237149824768 supervisor.py:1099] global_step/sec: 0
2021-10-24 13:57:51.662253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:52.787163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:Recording summary at step 1.
I1024 13:57:55.259541 140235672987392 supervisor.py:1050] Recording summary at step 1.
INFO:tensorflow:global step 10: loss = 9.6467 (0.115 sec/step)
I1024 13:57:56.277639 140239494461312 learning.py:507] global step 10: loss = 9.6467 (0.115 sec/step)
INFO:tensorflow:global step 20: loss = 0.7245 (0.106 sec/step)
I1024 13:57:57.399851 140239494461312 learning.py:507] global step 20: loss = 0.7245 (0.106 sec/step)
INFO:tensorflow:global step 30: loss = 9.5159 (0.109 sec/step)
I1024 13:57:58.544558 140239494461312 learning.py:507] global step 30: loss = 9.5159 (0.109 sec/step)
INFO:tensorflow:global step 40: loss = 0.6637 (0.106 sec/step)
I1024 13:57:59.686780 140239494461312 learning.py:507] global step 40: loss = 0.6637 (0.106 sec/step)
INFO:tensorflow:global step 50: loss = 0.7424 (0.140 sec/step)
I1024 13:58:00.898716 140239494461312 learning.py:507] global step 50: loss = 0.7424 (0.140 sec/step)
INFO:tensorflow:global step 60: loss = 21.9683 (0.141 sec/step)
I1024 13:58:02.276094 140239494461312 learning.py:507] global step 60: loss = 21.9683 (0.141 sec/step)
INFO:tensorflow:global step 70: loss = 0.6486 (0.132 sec/step)
I1024 13:58:03.593588 140239494461312 learning.py:507] global step 70: loss = 0.6486 (0.132 sec/step)
INFO:tensorflow:global step 80: loss = 9.6484 (0.135 sec/step)
I1024 13:58:04.992696 140239494461312 learning.py:507] global step 80: loss = 9.6484 (0.135 sec/step)
INFO:tensorflow:global step 90: loss = 0.6877 (0.114 sec/step)
I1024 13:58:06.135541 140239494461312 learning.py:507] global step 90: loss = 0.6877 (0.114 sec/step)
INFO:tensorflow:global step 100: loss = 4.4349 (0.116 sec/step)
I1024 13:58:07.301742 140239494461312 learning.py:507] global step 100: loss = 4.4349 (0.116 sec/step)

When loss varies and doesn't converge to a desired minimum, there are a few situations you can check, for object detection.当损失变化并且没有收敛到所需的最小值时,您可以检查几种情况,用于对象检测。

I'm assuming you are fine-tuning to a specific dataset with your own custom objects/classes.我假设您正在使用自己的自定义对象/类对特定数据集进行微调。

  1. Make sure your bounding boxes are well fitted to the object.确保您的边界框非常适合对象。 From past experiences, this is a key point for a proper training.根据以往的经验,这是进行适当培训的关键点。 Spend as much time as you can building a robust dataset with good annotations.花尽可能多的时间来构建具有良好注释的强大数据集。
  2. What are the dimensions of your bounding boxes compared to the image?与图像相比,边界框的尺寸是多少? Too large or too small bounding boxes sometimes are difficult to detect, depending on the chosen hyperparameters, my workaround on this is to vary the aspect ratios and scales of the bounding box generator of the model.太大或太小的边界框有时难以检测,这取决于所选的超参数,我的解决方法是改变模型边界框生成器的纵横比和比例。
  3. Is your dataset similar across all frames?您的数据集在所有帧中是否相似? Do you have on average the same amount of bounding boxes in each frame?你平均每帧有相同数量的边界框吗? Increased variability in your dataset makes your model susceptible to weak training.数据集中可变性的增加使您的模型容易受到弱训练的影响。 So if this is your situation, you should consider enlarging your dataset with more data to decrease variability, or choose only similar frames by removing "outliers".因此,如果这是您的情况,您应该考虑使用更多数据扩大数据集以减少可变性,或者通过删除“异常值”仅选择相似的帧。
  4. Does your number of classes/objects correspond to the number classes configured in the model?您的类/对象数量是否与模型中配置的类数量相对应? If you only have 10 classes and the model is built for 100, this may lead to you situation.如果您只有 10 个类并且模型是为 100 个构建的,这可能会导致您的情况。
  5. Check your learning rate.检查你的学习率。 For pre-trained models, start of with a very small learning rate (1e-4 or 1e-5).对于预训练模型,从非常小的学习率(1e-4 或 1e-5)开始。 With a large learning rate your model may end just bouncing between the same local minima (unstable loss).如果学习率很大,您的模型可能会在相同的局部最小值(不稳定损失)之间反弹。 Try to implement a scheduler as well, reducing the learning rate every X steps.也尝试实现一个调度程序,每 X 步降低学习率。

Also take in consideration that your dataset is "small", which makes training much harder and sometimes requires a lot of hyperparameter tuning, nonetheless, you should be capable to achieve much higher mAP, mAR and decreased loss with some of the default config.还要考虑到你的数据集是“小”的,这使得训练更加困难,有时需要大量的超参数调整,尽管如此,你应该能够通过一些默认配置实现更高的 mAP、mAR 和减少损失。

Bonus tip, that repo is a bit old, consider taking a look at the Tensorflow Object Detection API here and here .额外提示,该 repo 有点旧,请考虑在此处此处查看 Tensorflow 对象检测 API。 It provides a wider range of models including SOTA ones, and the configuration is quite similar to what you are already doing.它提供了更广泛的模型,包括 SOTA 模型,并且配置与您已经在做的非常相似。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM