简体   繁体   English

tensorflow slim 多 GPU 无法工作

[英]tensorflow slim multi-GPU can't work

Currently I use tensorflow slim to train the model from scrach.目前我使用 tensorflow slim 从头开始​​训练模型。 If I just follow the instruction here https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch , everything is OK.如果我只是按照这里的说明https://github.com/tensorflow/models/tree/master/slim#training-a-model-from-scratch ,一切正常。

However, I want to use multi GPU, so I set --num_clones=2 or 4, both of them are not working.但是,我想使用多 GPU,所以我设置了 --num_clones=2 或 4,它们都不起作用。 The result is that both of them get stuck at global_step/sec: 0. They can't continue.结果两个都卡在global_step/sec: 0,无法继续。 You can see the result image here error result您可以在此处查看结果图像错误结果

DATASET_DIR=/tmp/imagenet
TRAIN_DIR=/tmp/train_logs
python train_image_classifier.py \
--num_clones=4 \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v3

Hope someone can help me, thanks in advance.希望有人可以帮助我,在此先感谢。 By the way, I use tensorflow 1.1 & python 35 on Ubuntu 16.04.顺便说一句,我在 Ubuntu 16.04 上使用 tensorflow 1.1 & python 35。 If you need more information, please let me know.如果您需要更多信息,请告诉我。

Your issue resembles an experience I had after switching from a single-GPU to a multi-GPU configuration using tf-slim.您的问题类似于我使用 tf-slim 从单 GPU 切换到多 GPU 配置后的体验。 I observed that the parameter server job assumed the name 'localhost', which conflicted with the default job name assigned by model_deploy to my CPU device.我观察到参数服务器作业采用名称“localhost”,这与 model_deploy 分配给我的 CPU 设备的默认作业名称冲突。 I suggest you inspect the device names by following the "Logging Device placement" section of this tensorflow.org article .我建议您按照这篇 tensorflow.org 文章的“记录设备放置”部分检查设备名称。 It explains how to print device names to the console on a per-operation basis.它解释了如何在每个操作的基础上将设备名称打印到控制台。 You can then pass the actual job name as an argument to DeployConfig()'s ps_job_name parameter and proceed with training.然后,您可以将实际作业名称作为参数传递给 DeployConfig() 的ps_job_name参数并继续训练。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM