[英]Tensorflow Cloud ML Object Detection - Errors with Distributed Training
我正在尝试遵循Tensorflow的对象检测教程进行分布式训练,以开发自己的模型,但是使用的代码与存储库中的代码完全相同。
我对本教程进行了一些更改,特别是使用运行时1.5代替了本教程中所说的1.2。 尝试在Google Cloud ML上运行时,没有任何明显的错误(我可以看到),但是该任务无需培训即可迅速退出。
这是我用来开始训练工作的命令:
gcloud ml-engine jobs submit training object_detection_`date +%s`
--job-dir=gs://test-bucket/training/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz
--module-name object_detection.train
--region us-central1
--config ./config.yaml
--
--train_dir=gs://test-bucket/data/
--pipeline_config_path=gs://test-bucket/configs/ssd_inception_v2_coco.config
这是我的config.yaml:
trainingInput:
runtimeVersion: "1.5"
scaleTier: CUSTOM
masterType: complex_model_l
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: large_model
最后,我的工作日志完成了:
I worker-replica-6 Clean up finished. worker-replica-6
I worker-replica-7 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-7
I worker-replica-7 Module completed; cleaning up. worker-replica-7
I worker-replica-7 Clean up finished. worker-replica-7
I worker-replica-8 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-8
I worker-replica-8 Module completed; cleaning up. worker-replica-8
I worker-replica-8 Clean up finished. worker-replica-8
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-1
I worker-replica-1 Module completed; cleaning up. worker-replica-1
I worker-replica-1 Clean up finished. worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I Finished tearing down TensorFlow.
I Job failed.
正如我提到的那样,我无法从日志中获得有用的信息。 再往上走,我会收到此错误: Master init: Unavailable: Stream removed
但不确定如何处理。 感谢您朝正确方向的推动!
我转载了您的问题。 我按照以下步骤修复了它:
roysheffi在3个月前对此问题发表了评论。 嗨@pkulzc,我想我可能有领先优势:
在357行上,object_detection / trainer.py调用tf.contrib.slim.learning.train(),它使用了已弃用的 tf.train.Supervisor,并且应迁移到 tf.train.MonitoredTrainingSession中,如tf.train.Supervisor中所述
这已在tensorflow / tensorflow#15793中请求,并在yahoo / TensorFlowOnSpark#245的最后评论中被报告为tensorflow / tensorflow#17852的解决方案。 [ 1 ]
因此,最后,我在trainer.py中完成了此操作:
tf.train.MonitoredTrainingSession(
而不是slim.learning.train(
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.