[英]How to prepare docker image for SageMaker training job
Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?假设我有 docker 图像,其中包含机器学习的训练代码 model。我如何调整它以适应 SageMaker 训练作业,以便我可以在那里运行 docker 图像?
There are several things to keep in mind when adapting docker image for SageMaker Training Job:在为 SageMaker 训练作业调整 docker 图像时,需要注意以下几点:
Training code should be located in /opt/ml/code/
in the docker image and the main script should be /opt/ml/code/train
.训练代码应位于 docker 图像中的/opt/ml/code/
中,主脚本应为/opt/ml/code/train
。 Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train
does the trick).此外,该脚本应该具有执行它的权限( chmod 777 /opt/ml/code/train
可以解决问题)。 Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}"
.同样不重要,但很有用 - 如果您需要进行任何导入,您可能需要将代码添加到路径export PATH="/opt/ml/code:${PATH}"
。 By default SageMaker runs for training docker run image train
, but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.默认情况下,SageMaker 运行训练docker run image train
,但您也可以使用自定义入口点,以便保留训练代码的原始路径。
Hyperparameters which are provided as part of HyperParameters
parameters are saved as json file in /opt/ml/input/config/hyperparameters.json
so your training code has to read them from there.作为HyperParameters
参数的一部分提供的超参数在/opt/ml/input/config/hyperparameters.json
中保存为 json 文件,因此您的训练代码必须从那里读取它们。 Keep in mind that it only supports string fields and has the following limits:请记住,它仅支持字符串字段并具有以下限制:
To save the model, it should be put all files into /opt/ml/model/
folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.要保存 model,应将所有文件放入/opt/ml/model/
文件夹 - SageMaker 将此处的所有文件压缩为 TAR 格式并将其保存到训练作业中指定位置的 S3。
Metrics are gathered by running regular expressions on the logs produced by the container.通过在容器生成的日志上运行正则表达式来收集指标。 These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*)
).这些在训练作业的指标定义中定义,您可以在其中为现有日志提供正则表达式(例如,每个时期后的损失与val_loss: (.*)
)。 The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}"
, corresponding regex Value = (.*)
, corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"}
.启用度量友好日志的最少内容如下 - Log f"Value = {value}"
,对应的正则表达式Value = (.*)
,对应的度量定义{"Name": "value", "Regex": "Value = (.*)"}
。
Important documentation:重要文件:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.