简体   繁体   English

如何为 SageMaker 训练作业准备 docker 图像

[英]How to prepare docker image for SageMaker training job

Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?假设我有 docker 图像,其中包含机器学习的训练代码 model。我如何调整它以适应 SageMaker 训练作业,以便我可以在那里运行 docker 图像?

There are several things to keep in mind when adapting docker image for SageMaker Training Job:在为 SageMaker 训练作业调整 docker 图像时,需要注意以下几点:

  1. Training code训练代码

Training code should be located in /opt/ml/code/ in the docker image and the main script should be /opt/ml/code/train .训练代码应位于 docker 图像中的/opt/ml/code/中,主脚本应为/opt/ml/code/train Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train does the trick).此外,该脚本应该具有执行它的权限( chmod 777 /opt/ml/code/train可以解决问题)。 Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}" .同样不重要,但很有用 - 如果您需要进行任何导入,您可能需要将代码添加到路径export PATH="/opt/ml/code:${PATH}" By default SageMaker runs for training docker run image train , but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.默认情况下,SageMaker 运行训练docker run image train ,但您也可以使用自定义入口点,以便保留训练代码的原始路径。

  1. Hyperparameters超参数

Hyperparameters which are provided as part of HyperParameters parameters are saved as json file in /opt/ml/input/config/hyperparameters.json so your training code has to read them from there.作为HyperParameters参数的一部分提供的超参数在/opt/ml/input/config/hyperparameters.json中保存为 json 文件,因此您的训练代码必须从那里读取它们。 Keep in mind that it only supports string fields and has the following limits:请记住,它仅支持字符串字段并具有以下限制:

  • Map Entries: Minimum number of 0 items. Map 条目:最少 0 个条目。 Maximum number of 100 items.最多 100 个项目。
  • Key Length Constraints: Maximum length of 256.密钥长度限制:最大长度为 256。
  • Value Length Constraints: Maximum length of 2500.值长度限制:最大长度为 2500。
  1. Output Output

To save the model, it should be put all files into /opt/ml/model/ folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.要保存 model,应将所有文件放入/opt/ml/model/文件夹 - SageMaker 将此处的所有文件压缩为 TAR 格式并将其保存到训练作业中指定位置的 S3。

  1. [Optional] Metrics [可选] 指标

Metrics are gathered by running regular expressions on the logs produced by the container.通过在容器生成的日志上运行正则表达式来收集指标。 These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*) ).这些在训练作业的指标定义中定义,您可以在其中为现有日志提供正则表达式(例如,每个时期后的损失与val_loss: (.*) )。 The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}" , corresponding regex Value = (.*) , corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"} .启用度量友好日志的最少内容如下 - Log f"Value = {value}" ,对应的正则表达式Value = (.*) ,对应的度量定义{"Name": "value", "Regex": "Value = (.*)"}

Important documentation:重要文件:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Sagemaker 脚本模式下恢复训练作业? - How can I resume a training job in Sagemaker script mode? 从 aws glue 工作启动 sagemaker 培训工作 - Initiating sagemaker training job from aws glue job 如何在 sagemaker 管道中只运行训练步骤? - how to run only training step in a sagemaker pipeline? AWS Sagemaker - 自定义培训作业不保存 Model output - AWS Sagemaker - Custom Training Job not saving Model output 在 SageMaker 的自定义 docker 图像中指定检查点路径 - Specify checkpoint path in custom docker image in SageMaker 如何更改 SageMaker Docker 容器的共享 memory? - How to alter shared memory for SageMaker Docker containers? AWS SageMaker 训练脚本:如何传递自定义用户参数 - AWS SageMaker training script: how to pass custom user parameters 如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业? - How to run a Python Google Cloud Dataflow job with a custom Docker image? 如何根据预定义的计划运行 AWS Sagemaker Studio 作业 - How to run AWS Sagemaker Studio job based on pre defined schedule 如何将 stepfunction executionId 解析为 SageMaker 批量转换作业名称? - How to parse stepfunction executionId to SageMaker batch transform job name?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM