简体   繁体   English

AWS sagemaker 培训工作 (Tensorflow) 在 Epoch 1 停止

[英]AWS sagemaker training job (Tensorflow) halts at Epoch 1

I am trying to train Maskrcnn with custom dataset.我正在尝试使用自定义数据集训练 Maskrcnn。 The code is running fine on my local machine in the same docker container, however, it gets stuck at the first epoch when I use aws sagemaker.该代码在同一 docker 容器中的本地计算机上运行良好,但是,当我使用 aws sagemaker 时,它会卡在第一个时期。

The log my error seen on sagemaker notebook for training job在 sagemaker notebook 上看到的用于训练作业的错误日志

I am using Tensorflow 2 implementing the github code provided by https://github.com/simone-viozzi/Mask-RCNN-training-with-docker-containers-on-Sagemaker我正在使用 Tensorflow 2 实现 https 提供的 github 代码://github.com/simone-viozzi/Mask-RCNN-training-with-docker-containers-on-Sagemaker

As Gili mentioned in the comments, you can try the example he pointed out or report the issue to the developer - https://github.com/simone-viozzi/Mask-RCNN-training-with-docker-containers-on-Sagemaker/issues .正如 Gili 在评论中提到的,您可以尝试他指出的示例或将问题报告给开发人员 - https://github.com/simone-viozzi/Mask-RCNN-training-with-docker-containers-on-Sagemaker /问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM