[英]Use SageMaker Pytorch image for training
I am trying to containerize the training process for a fine tuned BERT model and run it on SageMaker.我正在尝试将经过微调的 BERT model 的训练过程容器化并在 SageMaker 上运行它。 I was planning to use the pre-built SageMaker Pytorch GPU containers ( https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/ ) as my starting point but I am having issues pulling the images during my build process.
我打算使用预先构建的 SageMaker Pytorch GPU 容器( https://aws.amazon.com/releasenotes/available-deep-images在开始时拉动图像时有问题我的构建过程。
My Dockerfile looks like this:我的 Dockerfile 看起来像这样:
# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"
# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /bert /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
# this environment variable is used by the SageMaker PyTorch container to determine our program entry point
# for training and serving.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM bert/train
My build_and_push script:我的 build_and_push 脚本:
#!/usr/bin/env bash
# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.
# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
IMAGE="my-bert"
# parameters
PY_VERSION="py36"
# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)
if [ $? -ne 0 ]
then
exit 255
fi
chmod +x bert/train
# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-2}
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names ${IMAGE} || aws ecr create-repository --repository-name ${IMAGE}
echo "---> repository done.."
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $account.dkr.ecr.$region.amazonaws.com
echo "---> logged in to account ecr.."
# Get the login command from ECR in order to pull down the SageMaker PyTorch image
# aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# echo "---> logged in to pytorch ecr.."
echo "Building image with arch=gpu, region=${region}"
TAG="gpu-${PY_VERSION}"
FULLNAME="${account}.dkr.ecr.${region}.amazonaws.com/${IMAGE}:${TAG}"
docker build -t ${IMAGE}:${TAG} --build-arg ARCH="$arch" -f "Dockerfile" .
docker tag ${IMAGE}:${TAG} ${FULLNAME}
docker push ${FULLNAME}
I get the following message during the push and the sagemaker pytorch image is not pulled:我在推送期间收到以下消息,并且未提取 sagemaker pytorch 图像:
Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.5.0-gpu-py36-cu101-ubuntu16.04: no basic auth credentials
Please let me know if this is the correct way to use a pre-built SageMaker image and what I could do to fix this error.请让我知道这是否是使用预构建 SageMaker 映像的正确方法以及我可以采取哪些措施来修复此错误。
You should run something like this, before running the docker build:在运行 docker 构建之前,您应该运行这样的东西:
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com
This image (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04) is hosted in ECR.此图像 (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04) 托管在 ECR 中。
So when you want to pull it, make sure you have the correct AWS configuration (with your own AWS account's security tokens) and have run ecr login command before pulling the image.因此,当您想要拉取它时,请确保您拥有正确的 AWS 配置(使用您自己的 AWS 账户的安全令牌)并在拉取映像之前运行 ecr login 命令。
Example:例子:
aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884 aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.