簡體   English   中英

由於mAP低,SSD-300不穩定損失

[英]Unstable loss SSD-300 due to which mAP is low

我正在訓練我的 SSD-300 模型,我已將圖像大小調整為 300x300。 我正在使用 github repo 中提到的默認設置: https : //github.com/balancap/SSD-Tensorflow

訓練時損失不穩定。 我嘗試將其訓練到 50,000 個訓練步驟。 我得到的當前 mAP 是 0.26(VOC 2007)和 0.24(VOC 2012)

訓練集:1500 張圖片測試:300 張圖片

當前參數:

!python train_ssd_network.py --dataset_name=pascalvoc_2007 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.00004 --optimizer=adam --learning_rate=0.01 --batch_size=2 --gpu_memory_fraction=0.9 --learning_rate_decay_factor=0.94 -num_classes=3  --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box --eval_training_data=True

我該怎么做才能獲得良好的准確度 (mAP)?

損失的例子,損失甚至達到了80:

W1024 13:57:41.660651 140239494461312 deprecation.py:323] From train_ssd_network.py:256: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

W1024 13:57:41.676577 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

W1024 13:57:41.676797 140239494461312 module_wrapper.py:139] From train_ssd_network.py:292: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W1024 13:57:41.677163 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

W1024 13:57:41.677324 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/deployment/model_deploy.py:194: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
W1024 13:57:41.679852 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1057: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
W1024 13:57:41.998192 140239494461312 deprecation.py:323] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:476: dropout (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dropout instead.
WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

W1024 13:57:42.408573 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/nets/ssd_vgg_300.py:642: The name tf.losses.add_loss is deprecated. Please use tf.compat.v1.losses.add_loss instead.

WARNING:tensorflow:From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

W1024 13:57:42.419716 140239494461312 module_wrapper.py:139] From train_ssd_network.py:307: The name tf.summary.histogram is deprecated. Please use tf.compat.v1.summary.histogram instead.

WARNING:tensorflow:From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

W1024 13:57:42.420833 140239494461312 module_wrapper.py:139] From train_ssd_network.py:308: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

W1024 13:57:42.625701 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:105: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

W1024 13:57:42.629828 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:144: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

W1024 13:57:42.630920 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:245: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.

WARNING:tensorflow:From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

W1024 13:57:43.817304 140239494461312 module_wrapper.py:139] From train_ssd_network.py:367: The name tf.summary.merge is deprecated. Please use tf.compat.v1.summary.merge instead.

WARNING:tensorflow:From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

W1024 13:57:43.820022 140239494461312 module_wrapper.py:139] From train_ssd_network.py:372: The name tf.GPUOptions is deprecated. Please use tf.compat.v1.GPUOptions instead.

WARNING:tensorflow:From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W1024 13:57:43.820249 140239494461312 module_wrapper.py:139] From train_ssd_network.py:373: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

W1024 13:57:43.820408 140239494461312 module_wrapper.py:139] From train_ssd_network.py:375: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

W1024 13:57:43.963253 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:226: The name tf.gfile.IsDirectory is deprecated. Please use tf.io.gfile.isdir instead.

WARNING:tensorflow:From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

W1024 13:57:43.963784 140239494461312 module_wrapper.py:139] From /content/gdrive/MyDrive/Training_SSD/SSD-1/tf_utils.py:230: The name tf.logging.info is deprecated. Please use tf.compat.v1.logging.info instead.

INFO:tensorflow:Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
I1024 13:57:43.963922 140239494461312 tf_utils.py:230] Fine-tuning from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt. Ignoring missing vars: False
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W1024 13:57:44.120857 140239494461312 deprecation.py:323] From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/slim/python/slim/learning.py:742: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2021-10-24 13:57:44.436826: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2021-10-24 13:57:44.440876: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000170000 Hz
2021-10-24 13:57:44.441070: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.441100: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-10-24 13:57:44.442817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-10-24 13:57:44.554870: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.555802: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56449f9cb640 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-10-24 13:57:44.555833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
2021-10-24 13:57:44.556006: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.556564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:04.0
2021-10-24 13:57:44.556867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.558049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-10-24 13:57:44.559113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-10-24 13:57:44.559464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-10-24 13:57:44.560805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-10-24 13:57:44.561773: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-10-24 13:57:44.564919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:44.565038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.565658: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.566169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-10-24 13:57:44.566234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-10-24 13:57:44.567361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-10-24 13:57:44.567389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-10-24 13:57:44.567399: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-10-24 13:57:44.567544: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568127: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-10-24 13:57:44.568645: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-10-24 13:57:44.568687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:04.0, compute capability: 6.0)
INFO:tensorflow:Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
I1024 13:57:45.783673 140239494461312 saver.py:1284] Restoring parameters from /content/gdrive/MyDrive/Training_SSD/SSD-1/checkpoints/ssd_300_vgg.ckpt/ssd_300_vgg.ckpt
INFO:tensorflow:Running local_init_op.
I1024 13:57:46.017776 140239494461312 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I1024 13:57:46.075058 140239494461312 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Starting Session.
I1024 13:57:47.806292 140239494461312 learning.py:754] Starting Session.
INFO:tensorflow:Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
I1024 13:57:47.882676 140237141432064 supervisor.py:1117] Saving checkpoint to path /content/gdrive/MyDrive/Training_SSD/SSD-1/log_30000/model.ckpt
INFO:tensorflow:Starting Queues.
I1024 13:57:47.896139 140239494461312 learning.py:768] Starting Queues.
INFO:tensorflow:global_step/sec: 0
I1024 13:57:51.071433 140237149824768 supervisor.py:1099] global_step/sec: 0
2021-10-24 13:57:51.662253: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-10-24 13:57:52.787163: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:Recording summary at step 1.
I1024 13:57:55.259541 140235672987392 supervisor.py:1050] Recording summary at step 1.
INFO:tensorflow:global step 10: loss = 9.6467 (0.115 sec/step)
I1024 13:57:56.277639 140239494461312 learning.py:507] global step 10: loss = 9.6467 (0.115 sec/step)
INFO:tensorflow:global step 20: loss = 0.7245 (0.106 sec/step)
I1024 13:57:57.399851 140239494461312 learning.py:507] global step 20: loss = 0.7245 (0.106 sec/step)
INFO:tensorflow:global step 30: loss = 9.5159 (0.109 sec/step)
I1024 13:57:58.544558 140239494461312 learning.py:507] global step 30: loss = 9.5159 (0.109 sec/step)
INFO:tensorflow:global step 40: loss = 0.6637 (0.106 sec/step)
I1024 13:57:59.686780 140239494461312 learning.py:507] global step 40: loss = 0.6637 (0.106 sec/step)
INFO:tensorflow:global step 50: loss = 0.7424 (0.140 sec/step)
I1024 13:58:00.898716 140239494461312 learning.py:507] global step 50: loss = 0.7424 (0.140 sec/step)
INFO:tensorflow:global step 60: loss = 21.9683 (0.141 sec/step)
I1024 13:58:02.276094 140239494461312 learning.py:507] global step 60: loss = 21.9683 (0.141 sec/step)
INFO:tensorflow:global step 70: loss = 0.6486 (0.132 sec/step)
I1024 13:58:03.593588 140239494461312 learning.py:507] global step 70: loss = 0.6486 (0.132 sec/step)
INFO:tensorflow:global step 80: loss = 9.6484 (0.135 sec/step)
I1024 13:58:04.992696 140239494461312 learning.py:507] global step 80: loss = 9.6484 (0.135 sec/step)
INFO:tensorflow:global step 90: loss = 0.6877 (0.114 sec/step)
I1024 13:58:06.135541 140239494461312 learning.py:507] global step 90: loss = 0.6877 (0.114 sec/step)
INFO:tensorflow:global step 100: loss = 4.4349 (0.116 sec/step)
I1024 13:58:07.301742 140239494461312 learning.py:507] global step 100: loss = 4.4349 (0.116 sec/step)

當損失變化並且沒有收斂到所需的最小值時,您可以檢查幾種情況,用於對象檢測。

我假設您正在使用自己的自定義對象/類對特定數據集進行微調。

  1. 確保您的邊界框非常適合對象。 根據以往的經驗,這是進行適當培訓的關鍵點。 花盡可能多的時間來構建具有良好注釋的強大數據集。
  2. 與圖像相比,邊界框的尺寸是多少? 太大或太小的邊界框有時難以檢測,這取決於所選的超參數,我的解決方法是改變模型邊界框生成器的縱橫比和比例。
  3. 您的數據集在所有幀中是否相似? 你平均每幀有相同數量的邊界框嗎? 數據集中可變性的增加使您的模型容易受到弱訓練的影響。 因此,如果這是您的情況,您應該考慮使用更多數據擴大數據集以減少可變性,或者通過刪除“異常值”僅選擇相似的幀。
  4. 您的類/對象數量是否與模型中配置的類數量相對應? 如果您只有 10 個類並且模型是為 100 個構建的,這可能會導致您的情況。
  5. 檢查你的學習率。 對於預訓練模型,從非常小的學習率(1e-4 或 1e-5)開始。 如果學習率很大,您的模型可能會在相同的局部最小值(不穩定損失)之間反彈。 也嘗試實現一個調度程序,每 X 步降低學習率。

還要考慮到你的數據集是“小”的,這使得訓練更加困難,有時需要大量的超參數調整,盡管如此,你應該能夠通過一些默認配置實現更高的 mAP、mAR 和減少損失。

額外提示,該 repo 有點舊,請考慮在此處此處查看 Tensorflow 對象檢測 API。 它提供了更廣泛的模型,包括 SOTA 模型,並且配置與您已經在做的非常相似。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM