简体繁体 English

AWS Sagemaker - 自定义培训作业不保存 Model output

[英]AWS Sagemaker - Custom Training Job not saving Model output

原文 2022-03-19 21:43:13 3 1 amazon-web-services/ docker/ machine-learning/ deployment/ amazon-sagemaker

I'm running a training job using AWS SageMaker and i'm using a custom Estimator based on an available docker image from AWS.我正在使用 AWS SageMaker 运行训练作业，并且我正在使用基于 AWS 可用图像 docker 的自定义 Estimator。 I wanted to get some feedback on whether my process is correct or not prior to deployment.我想在部署之前就我的流程是否正确获得一些反馈。

I'm running the training job in a docker container using 'local' in a SageMaker notebook instance and the training job runs successfully.我在 SageMaker 笔记本实例中使用“本地”在 docker 容器中运行训练作业，并且训练作业成功运行。 However, after the job completes and saves the model to opt/model/models within the docker image, once the docker container exits, the model saved from training is lost.但是，在作业完成并将 model 保存到 docker 图像中的 opt/model/models 之后，一旦 docker 容器退出，从训练中保存的 model 就会丢失。 Ideally, i'd like to use the model for inference, however, I'm not sure about the best way of doing it.理想情况下，我想使用 model 进行推理，但是，我不确定这样做的最佳方式。 I have also tried the training job after pushing the image to ECR, but the same thing happens.将图像推送到 ECR 后，我也尝试过训练工作，但同样的事情发生了。

It is my understanding that the docker state is lost, once the image exits, as such, is it possible to persist the model that was produced in training in the image?我的理解是 docker state 丢失了，一旦图像退出，那么是否可以将训练中产生的 model 保留在图像中？ One option I have thought about is saving the model output to an S3 bucket once the training job is complete, then pulling that model into another docker image for inference.我考虑过的一个选择是在训练作业完成后将 model output 保存到 S3 存储桶中，然后将该 model 拉到另一个 docker 图像中进行推理。 Is this expected behaviour and the correct way of doing it?这是预期的行为和正确的做法吗？

I am fairly new to using SageMaker but i'd like to do it according to best practices.我对使用 SageMaker 还很陌生，但我想根据最佳实践进行操作。 I've looked at a lot of the AWS documents and followed the tutorials but it doesn't seem to mention explicitly if this is how it should be done.我看过很多 AWS 文档并遵循了教程，但它似乎没有明确提及是否应该这样做。

Thanks for any feedback on this.感谢您对此的任何反馈。

1 个解决方案

You can refer to Rok's comment on saving a model file when you're using a custom estimator.使用自定义估算器时，您可以参考 Rok 关于保存 model 文件的评论。 That said, SageMaker built-in estimators save the model artifacts to S3.也就是说，SageMaker 内置估算器将 model 个工件保存到 S3。 To make inferences using that model, you can either use a real-time inference endpoint for real time predictions, or a batch transformer to run inferences in batch mode.要使用 model 进行推理，您可以使用实时推理端点进行实时预测，或者使用批处理转换器以批处理模式运行推理。 In both cases, you'll have to point the configuration to the container for inference and the model artifacts.在这两种情况下，您都必须将配置指向用于推理的容器和 model 工件。 the amazon-sagemaker-examples repository has examples for common frameworks, especially, the scikit-learn example has detailed explanations. amazon-sagemaker-examples存储库中有常见框架的示例，特别是scikit-learn 示例有详细的解释。

Also, make sure the model is being saved to /opt/ml/model/ , not opt/model/models as mentioned in your question.另外，确保将 model 保存到/opt/ml/model/ ，而不是您问题中提到的opt/model/models 。