Let's say I have docker image which has the training code for Machine Learning model. How can I adapt it for SageMaker Training Job so I can run the docker image there?
There are several things to keep in mind when adapting docker image for SageMaker Training Job:
Training code should be located in /opt/ml/code/
in the docker image and the main script should be /opt/ml/code/train
. Also, the script should have permissions for executing it ( chmod 777 /opt/ml/code/train
does the trick). Also not critical, but useful - if you need to do any imports you may need to add code to the path export PATH="/opt/ml/code:${PATH}"
. By default SageMaker runs for training docker run image train
, but alternatively you can use a custom entrypoint so that you can keep the original path for the training code.
Hyperparameters which are provided as part of HyperParameters
parameters are saved as json file in /opt/ml/input/config/hyperparameters.json
so your training code has to read them from there. Keep in mind that it only supports string fields and has the following limits:
To save the model, it should be put all files into /opt/ml/model/
folder - SageMaker compresses all the files here into the TAR format and saves it to S3 in the location specified in the training job.
Metrics are gathered by running regular expressions on the logs produced by the container. These are defined in metrics definition of the training job where you can provide regular expressions for existing logs (eg loss after each epoch with val_loss: (.*)
). The minimal thing to enable metric-friendly log is the following - Log f"Value = {value}"
, corresponding regex Value = (.*)
, corresponding metric definition {"Name": "value", "Regex": "Value = (.*)"}
.
Important documentation:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.