减少 SSD-300 的训练步骤

Question

I am new to deep learning and I am trying to train my SSD-300 (single shot detector) model which is taking too long.我是深度学习的新手，我正在尝试训练我的 SSD-300（单次检测器）模型，这需要很长时间。 For example even though I ran 50 epochs, it is training for 108370+ global steps.例如，尽管我运行了 50 个 epoch，但它正在训练 108370 多个全局步骤。 I am using the default train_ssd_network.py file from the official github repo: https://github.com/balancap/SSD-Tensorflow我正在使用官方 github 存储库中的默认 train_ssd_network.py 文件： https : //github.com/balancap/SSD-Tensorflow

The command I ran for training:我运行的训练命令：

!python train_ssd_network.py --dataset_name=pascalvoc_2007 epochs= 50 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.0005 --optimizer=adam --learning_rate=0.001 --batch_size=6 --gpu_memory_fraction=0.9 --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box

How can I change the training steps and what is the ideal training steps?如何更改训练步骤以及理想的训练步骤是什么？

The train_ssd_network.py does not provide a specific number related to global_steps train_ssd_network.py 没有提供与 global_steps 相关的具体数字

Answer 1

Since it does not have a parameter to set the value you want you would have to go into the source code and find where the batch size and test steps are set for the training set.由于它没有参数来设置您想要的值，因此您必须进入源代码并找到为训练集设置批量大小和测试步骤的位置。 The values you use for training batch size and training steps if determined by your model type and the size of your training data.如果由模型类型和训练数据的大小决定，则用于训练批量大小和训练步骤的值。 For example if your were say classifying images and the image shape is (64,64,3) you can probably set a fairly large batch size without getting a resource exhaust error.例如，如果您说对图像进行分类并且图像形状为 (64,64,3)，您可能可以设置相当大的批量大小而不会出现资源耗尽错误。 Say batch_size=100.说batch_size=100。 If your image shape is say (500, 500, 3) then you need a much smaller batch size say batch_size=20.如果你的图像形状是 (500, 500, 3) 那么你需要一个更小的批量，比如 batch_size=20。 Usually in model.fit you do not need to specify the value of steps.通常在 model.fit 中你不需要指定步骤的值。 Leave it as None and model.fit will calculate the steps internally.将其保留为 None 并且 model.fit 将在内部计算步骤。 Same is true for model.predict. model.predict 也是如此。 If you really need to calculate the steps say for the test set you want to go through the test set exactly once.如果您真的需要计算测试集的步骤，那么您只想通过测试集一次。 For this to happen batch_size X steps= number of samples in the test set.为此，batch_size X 步数=测试集中的样本数。 The code below will calculate that for you.下面的代码将为您计算。 Value bmax is a value you set as the maximum allowable batch_size based on the above discussion.值 bmax 是您根据上述讨论设置为最大允许 batch_size 的值。 For example below assume there are 10,000 samples in the test set.例如，下面假设测试集中有 10,000 个样本。

length=10000 # number of samples in the test set
bmax=50 # maximum batch size limit to avoid resource exhaust error
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=bmax],reverse=True)[0]  
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, '  test steps: ', test_steps)

the result would be结果是

test batch size:  50   test steps:  200

Answer 2

It looks like the module you are using supports a "max_number_of_steps" flag, which could be used like like --max_number_of_steps=10000 as part of your command line statement.看起来您正在使用的模块支持“max_number_of_steps”标志，它可以像--max_number_of_steps=10000一样用作命令行语句的一部分。 The module relies on tensorflow flags to take input from the command line.该模块依赖tensorflow 标志从命令行获取输入。 You can see all the supported flags here with some descriptions.您可以在此处查看所有支持的标志以及一些说明。

I see in another answer that you found the relevant flag and changed the second argument, None, to another value.我在另一个答案中看到您找到了相关标志并将第二个参数 None 更改为另一个值。 This second argument is the default value .第二个参数是默认值。 Changing it should work, but is not necessary, since you could also pass that value in through the command line.更改它应该有效，但不是必需的，因为您也可以通过命令行传递该值。

tf.app.flags.DEFINE_integer('max_number_of_steps', None,
                                'The maximum number of training steps.')

The ideal training number of training steps depends on your data and application.理想的训练步骤数取决于您的数据和应用程序。 A common technique to see if you need to train for longer is to measure the model's loss over time during training and to stop training when loss is no longer decreasing substantially.查看是否需要更长时间训练的常用技术是在训练期间测量模型随时间的损失，并在损失不再大幅减少时停止训练。

减少 SSD-300 的训练步骤

问题描述

2 个解决方案

解决方案1
1 2021-10-17 17:56:07

解决方案2
0 2021-10-19 01:49:19

减少 SSD-300 的训练步骤

问题描述

2 个解决方案

解决方案1 1 2021-10-17 17:56:07

解决方案2 0 2021-10-19 01:49:19

解决方案1
1 2021-10-17 17:56:07

解决方案2
0 2021-10-19 01:49:19