简体   繁体   English

Tensorflow Cloud ML对象检测-分布式训练中的错误

[英]Tensorflow Cloud ML Object Detection - Errors with Distributed Training

I'm trying to follow Tensorflow's Object Detection tutorial for distributed training my own model but am using the code exactly as is from the repository . 我正在尝试遵循Tensorflow的对象检测教程进行分布式训练,以开发自己的模型,但是使用的代码与存储库中的代码完全相同。

I've made a couple changes from the tutorial, notably to use runtime 1.5 instead of 1.2 as said in the tutorial. 我对本教程进行了一些更改,特别是使用运行时1.5代替了本教程中所说的1.2。 There aren't any explicit errors (that I can see) when I try running on Google Cloud ML, but the task quickly exits without training. 尝试在Google Cloud ML上运行时,没有任何明显的错误(我可以看到),但是该任务无需培训即可迅速退出。

Here's the command I use to start the training job: 这是我用来开始训练工作的命令:

gcloud ml-engine jobs submit training object_detection_`date +%s`
    --job-dir=gs://test-bucket/training/
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz
    --module-name object_detection.train
    --region us-central1
    --config ./config.yaml
    --
    --train_dir=gs://test-bucket/data/
    --pipeline_config_path=gs://test-bucket/configs/ssd_inception_v2_coco.config

And this is my config.yaml: 这是我的config.yaml:

trainingInput:
  runtimeVersion: "1.5"
  scaleTier: CUSTOM
  masterType: complex_model_l
  workerCount: 9
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: large_model

And finally the logs from my job finishing up: 最后,我的工作日志完成了:

I  worker-replica-6 Clean up finished.  worker-replica-6
I  worker-replica-7 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-7
I  worker-replica-7 Module completed; cleaning up.  worker-replica-7
I  worker-replica-7 Clean up finished.  worker-replica-7
I  worker-replica-8 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-8
I  worker-replica-8 Module completed; cleaning up.  worker-replica-8
I  worker-replica-8 Clean up finished.  worker-replica-8
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior.  worker-replica-1
I  worker-replica-1 Module completed; cleaning up.  worker-replica-1
I  worker-replica-1 Clean up finished.  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-3
I  worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-0
I  worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-2
I  worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-5
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-3
I  worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-0
I  worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-2
I  worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-5
I  worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-1
I  worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-7
I  worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-8
I  worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0  worker-replica-6
I  Finished tearing down TensorFlow. 
I  Job failed.

As I mentioned, I haven't been able to get something useful from the logs. 正如我提到的那样,我无法从日志中获得有用的信息。 A bit further up I get this error Master init: Unavailable: Stream removed but am unsure how to handle this. 再往上走,我会收到此错误: Master init: Unavailable: Stream removed但不确定如何处理。 Thanks for any push in the right direction! 感谢您朝正确方向的推动!

I reproduced your issue. 我转载了您的问题。 I fixed it following this: 我按照以下步骤修复了它:

roysheffi commented on this issue 3 months ago. roysheffi在3个月前对此问题发表了评论。 Hi @pkulzc, I think I may have a lead: 嗨@pkulzc,我想我可能有领先优势:

On line 357, object_detection/trainer.py calls tf.contrib.slim.learning.train() which uses the deprecated tf.train.Supervisor and should be migrated to tf.train.MonitoredTrainingSession instead, as documented in tf.train.Supervisor 在357行上,object_detection / trainer.py调用tf.contrib.slim.learning.train(),它使用了已弃用的 tf.train.Supervisor,并且应迁移到 tf.train.MonitoredTrainingSession中,如tf.train.Supervisor中所述

This is already requested in tensorflow/tensorflow#15793 and is reported as a solution to tensorflow/tensorflow#17852 on the last comment of yahoo/TensorFlowOnSpark#245. 这已在tensorflow / tensorflow#15793中请求,并在yahoo / TensorFlowOnSpark#245的最后评论中被报告为tensorflow / tensorflow#17852的解决方案。 [ 1 ] [ 1 ]

So, in the end, I did this inside trainer.py: 因此,最后,我在trainer.py中完成了此操作:

  • Put tf.train.MonitoredTrainingSession( instead of slim.learning.train( tf.train.MonitoredTrainingSession(而不是slim.learning.train(

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 运行TensorFlow对象检测API模型训练时出错 - Errors when running TensorFlow object detection API model training 有关对象检测训练的Cloud ML Engine警告消息:忽略具有图像ID的地面真相 - Cloud ML Engine warning message on Object Detection training : Ignoring ground truth with image id Tensorflow / object_detection /培训/ model_main.py / tensorflow.python.framework.errors_impl.NotFoundError: - Tensorflow / object_detection / training / model_main.py / tensorflow.python.framework.errors_impl.NotFoundError: TensorFlow对象检测API训练错误 - TensorFlow Object Detection API training error Tensorflow 对象检测 API 中的训练和验证准确性 - Training and Validation Accuracy in Tensorflow Object Detection API TensorFlow object 检测 API 评估训练性能 - TensorFlow object detection API evaluate training performance tensorflow对象检测API训练和导出图 - tensorflow object detection API training and export graph 谷歌云 object 检测 model 训练错误 - Google cloud object detection model training error 在 tensorflow 中训练 ml 网络时出错; 'int' object 不可下标 - error while training the ml network in tensorflow; 'int' object is not subscriptable Tensorflow输入管道用于分布式培训 - Tensorflow input pipeline for distributed training
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM