tensorflow slim multi-GPU can't work

Question

Currently I use tensorflow slim to train the model from scrach. If I just follow the instruction here https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch , everything is OK.

However, I want to use multi GPU, so I set --num_clones=2 or 4, both of them are not working. The result is that both of them get stuck at global_step/sec: 0. They can't continue. You can see the result image here error result

DATASET_DIR=/tmp/imagenet
TRAIN_DIR=/tmp/train_logs
python train_image_classifier.py \
--num_clones=4 \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v3

Hope someone can help me, thanks in advance. By the way, I use tensorflow 1.1 & python 35 on Ubuntu 16.04. If you need more information, please let me know.

Answer 1

Your issue resembles an experience I had after switching from a single-GPU to a multi-GPU configuration using tf-slim. I observed that the parameter server job assumed the name 'localhost', which conflicted with the default job name assigned by model_deploy to my CPU device. I suggest you inspect the device names by following the "Logging Device placement" section of this tensorflow.org article . It explains how to print device names to the console on a per-operation basis. You can then pass the actual job name as an argument to DeployConfig()'s ps_job_name parameter and proceed with training.

tensorflow slim multi-GPU can't work

Question

1 answers

solution1
0 2017-11-14 02:40:48

tensorflow slim multi-GPU can't work

Question

1 answers

solution1 0 2017-11-14 02:40:48

solution1
0 2017-11-14 02:40:48