Multi-GPU training using tf.slim takes more time than single GPU

Question

I'm fine-tuning ResNet50 on the CIFAR10 dataset using tf.slim's train_image_classifier.py script:

python train_image_classifier.py \                    
  --train_dir=${TRAIN_DIR}/all \                                                        
  --dataset_name=cifar10 \                                                              
  --dataset_split_name=train \                                                          
  --dataset_dir=${DATASET_DIR} \                                                        
  --checkpoint_path=${TRAIN_DIR} \                                                      
  --model_name=resnet_v1_50 \                                                           
  --max_number_of_steps=3000 \                                                          
  --batch_size=32 \                                                                     
  --num_clones=4 \                                                                      
  --learning_rate=0.0001 \                                                              
  --save_interval_secs=10 \                                                             
  --save_summaries_secs=10 \                                                            
  --log_every_n_steps=10 \                                                                 
  --optimizer=sgd

For 3k steps, running this on a single GPU (Tesla M40) takes around 30mn, while running on 4 GPUs takes 50+ mn. (The accuracy is similar in both cases: ~75% and ~78%).

I know that one possible cause of delay in multi-GPU setups is loading the images, but in the case of tf.slim, it uses the CPU for that. Any ideas of what could be the issue? Thank you!

Answer 1

You will not get faster When set num_clones to use multi gpu. Because slim will train batch_size * num_clones data split in each of your GPU. After that calculate each loss by div num_clones and sum the total loss. ( https://github.com/tensorflow/models/blob/master/research/slim/deployment/model_deploy.py )
When CPU become the bottleneck, input pipeline cannot product so much data for train. Then you will get 4 times slowly when set num_clones=4.( https://www.tensorflow.org/performance/performance_guide )

Multi-GPU training using tf.slim takes more time than single GPU

Question

1 answers

solution1
1 2017-12-13 10:15:44

Multi-GPU training using tf.slim takes more time than single GPU

Question

1 answers

solution1 1 2017-12-13 10:15:44

solution1
1 2017-12-13 10:15:44