简体   繁体   English

使用 SageMaker Pytorch 图像进行训练

[英]Use SageMaker Pytorch image for training

I am trying to containerize the training process for a fine tuned BERT model and run it on SageMaker.我正在尝试将经过微调的 BERT model 的训练过程容器化并在 SageMaker 上运行它。 I was planning to use the pre-built SageMaker Pytorch GPU containers ( https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/ ) as my starting point but I am having issues pulling the images during my build process.我打算使用预先构建的 SageMaker Pytorch GPU 容器( https://aws.amazon.com/releasenotes/available-deep-images在开始时拉动图像时有问题我的构建过程。

My Dockerfile looks like this:我的 Dockerfile 看起来像这样:

# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04


ENV PATH="/opt/ml/code:${PATH}"

# /opt/ml and all subdirectories are utilized by SageMaker, we use the /code subdirectory to store our user code.
COPY /bert /opt/ml/code

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# this environment variable is used by the SageMaker PyTorch container to determine our program entry point
# for training and serving.
# For more information: https://github.com/aws/sagemaker-pytorch-container
ENV SAGEMAKER_PROGRAM bert/train

My build_and_push script:我的 build_and_push 脚本:

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
IMAGE="my-bert"

# parameters
PY_VERSION="py36"

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
    exit 255
fi

chmod +x bert/train

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-east-2}

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names ${IMAGE} || aws ecr create-repository --repository-name ${IMAGE}

echo "---> repository done.."
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $account.dkr.ecr.$region.amazonaws.com
echo "---> logged in to account ecr.."

# Get the login command from ECR in order to pull down the SageMaker PyTorch image
# aws ecr get-login-password --region $region | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# echo "---> logged in to pytorch ecr.."

echo "Building image with arch=gpu, region=${region}"
TAG="gpu-${PY_VERSION}"
FULLNAME="${account}.dkr.ecr.${region}.amazonaws.com/${IMAGE}:${TAG}"
docker build -t ${IMAGE}:${TAG} --build-arg ARCH="$arch" -f "Dockerfile" .
docker tag ${IMAGE}:${TAG} ${FULLNAME}
docker push ${FULLNAME}

I get the following message during the push and the sagemaker pytorch image is not pulled:我在推送期间收到以下消息,并且未提取 sagemaker pytorch 图像:

Get https://763104351884.dkr.ecr.us-east-1.amazonaws.com/v2/pytorch-training/manifests/1.5.0-gpu-py36-cu101-ubuntu16.04: no basic auth credentials

Please let me know if this is the correct way to use a pre-built SageMaker image and what I could do to fix this error.请让我知道这是否是使用预构建 SageMaker 映像的正确方法以及我可以采取哪些措施来修复此错误。

You should run something like this, before running the docker build:在运行 docker 构建之前,您应该运行这样的东西:

aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.${region}.amazonaws.com

This image (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04) is hosted in ECR.此图像 (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.0-gpu-py36-cu101-ubuntu16.04) 托管在 ECR 中。

So when you want to pull it, make sure you have the correct AWS configuration (with your own AWS account's security tokens) and have run ecr login command before pulling the image.因此,当您想要拉取它时,请确保您拥有正确的 AWS 配置(使用您自己的 AWS 账户的安全令牌)并在拉取映像之前运行 ecr login 命令。

Example:例子:

aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884 aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用自定义 docker 映像初始化 pytorch 估算器以在 AWS Sagemaker 上进行训练时,应将什么定义为入口点? - What to define as entrypoint when initializing a pytorch estimator with a custom docker image for training on AWS Sagemaker? SageMaker 中的 sagemaker-pytorch-training-toolkit 和 sagemaker-training-toolkit 有什么区别? - What is the difference between sagemaker-pytorch-training-toolkit and sagemaker-training-toolkit in SageMaker? 如何为 SageMaker 训练作业准备 docker 图像 - How to prepare docker image for SageMaker training job 如何使用 SageMaker Estimator 进行 model 训练和保存 - How to use SageMaker Estimator for model training and saving 使用预构建 docker 映像的 Amazon sagemaker 训练作业 - Amazon sagemaker training job using prebuild docker image 正确的参数,用于训练每个图像有多个类的AWS Sagemaker - Correct parameters for training AWS Sagemaker with multiple classes per image 客户错误:不允许使用其他超参数-图像分类培训-Sagemaker - Customer Error: Additional hyperparameters are not allowed - Image classification training- Sagemaker 如何将boto3 cloudwatch用于SageMaker提交的培训工作? - How to use boto3 cloudwatch for SageMaker submitted training jobs? Sagemaker 中的培训工作正在停止 - Training Job is Stopping in Sagemaker AWS SageMaker PyTorch:没有名为“sagemaker”的模块 - AWS SageMaker PyTorch: no module named 'sagemaker'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM