简体   繁体   中英

Reduce Training steps for SSD-300

I am new to deep learning and I am trying to train my SSD-300 (single shot detector) model which is taking too long. For example even though I ran 50 epochs, it is training for 108370+ global steps. I am using the default train_ssd_network.py file from the official github repo: https://github.com/balancap/SSD-Tensorflow

The command I ran for training:

!python train_ssd_network.py --dataset_name=pascalvoc_2007 epochs= 50 --dataset_split_name=train --model_name=ssd_300_vgg --save_summaries_secs=60 --save_interval_secs=600 --weight_decay=0.0005 --optimizer=adam --learning_rate=0.001 --batch_size=6 --gpu_memory_fraction=0.9 --checkpoint_exclude_scopes =ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box

How can I change the training steps and what is the ideal training steps?

The train_ssd_network.py does not provide a specific number related to global_steps

Since it does not have a parameter to set the value you want you would have to go into the source code and find where the batch size and test steps are set for the training set. The values you use for training batch size and training steps if determined by your model type and the size of your training data. For example if your were say classifying images and the image shape is (64,64,3) you can probably set a fairly large batch size without getting a resource exhaust error. Say batch_size=100. If your image shape is say (500, 500, 3) then you need a much smaller batch size say batch_size=20. Usually in model.fit you do not need to specify the value of steps. Leave it as None and model.fit will calculate the steps internally. Same is true for model.predict. If you really need to calculate the steps say for the test set you want to go through the test set exactly once. For this to happen batch_size X steps= number of samples in the test set. The code below will calculate that for you. Value bmax is a value you set as the maximum allowable batch_size based on the above discussion. For example below assume there are 10,000 samples in the test set.

length=10000 # number of samples in the test set
bmax=50 # maximum batch size limit to avoid resource exhaust error
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=bmax],reverse=True)[0]  
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, '  test steps: ', test_steps)

the result would be

test batch size:  50   test steps:  200

It looks like the module you are using supports a "max_number_of_steps" flag, which could be used like like --max_number_of_steps=10000 as part of your command line statement. The module relies on tensorflow flags to take input from the command line. You can see all the supported flags here with some descriptions.

I see in another answer that you found the relevant flag and changed the second argument, None, to another value. This second argument is the default value . Changing it should work, but is not necessary, since you could also pass that value in through the command line.

tf.app.flags.DEFINE_integer('max_number_of_steps', None,
                                'The maximum number of training steps.')

The ideal training number of training steps depends on your data and application. A common technique to see if you need to train for longer is to measure the model's loss over time during training and to stop training when loss is no longer decreasing substantially.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM