简体   繁体   中英

tensorflow slim multi-GPU can't work

Currently I use tensorflow slim to train the model from scrach. If I just follow the instruction here https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch , everything is OK.

However, I want to use multi GPU, so I set --num_clones=2 or 4, both of them are not working. The result is that both of them get stuck at global_step/sec: 0. They can't continue. You can see the result image here error result

DATASET_DIR=/tmp/imagenet
TRAIN_DIR=/tmp/train_logs
python train_image_classifier.py \
--num_clones=4 \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v3

Hope someone can help me, thanks in advance. By the way, I use tensorflow 1.1 & python 35 on Ubuntu 16.04. If you need more information, please let me know.

Your issue resembles an experience I had after switching from a single-GPU to a multi-GPU configuration using tf-slim. I observed that the parameter server job assumed the name 'localhost', which conflicted with the default job name assigned by model_deploy to my CPU device. I suggest you inspect the device names by following the "Logging Device placement" section of this tensorflow.org article . It explains how to print device names to the console on a per-operation basis. You can then pass the actual job name as an argument to DeployConfig()'s ps_job_name parameter and proceed with training.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM