Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path

Question

Ok I've been dealing with this issue in Sagemaker for almost a week and I'm ready to pull my hair out. I've got a custom training script paired with a data processing script in a BYO algorithm Docker deployment type scenario. It's a Pytorch model built with Python 3.x, and the BYO Docker file was originally built for Python 2, but I can't see an issue with the problem that I am having.....which is that after a successful training run Sagemaker doesn't save the model to the target S3 bucket.

I've searched far and wide and can't seem to find an applicable answer anywhere. This is all done inside a Notebook instance. Note: I am using this as a contractor and don't have full permissions to the rest of AWS, including downloading the Docker image.

Dockerfile:

FROM ubuntu:18.04

MAINTAINER Amazon AI <sage-learner@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python-pip \
         python3-pip3
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
    pip3 install future numpy torch scipy scikit-learn pandas flask gevent gunicorn && \
        rm -rf /root/.cache

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY decision_trees /opt/program
WORKDIR /opt/program

Docker Image Build:

%%sh

algorithm_name="name-this-algo"

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

account=$(aws sts get-caller-identity --query Account --output text)

region=$(aws configure get region)
region=${region:-us-east-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Env setup and session start:

common_prefix = "pytorch-lstm"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"

import os
from sagemaker import get_execution_role
import sagemaker as sage

sess = sage.Session()

role = get_execution_role()
print(role)

Training Directory, Image, and Estimator Setup, then a fit call:

TRAINING_WORKDIR = "a/local/directory"

training_input = sess.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)

account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/image-that-works:working'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.p2.xlarge',
                       output_path="s3://sagemaker-directory-that-definitely/exists",
                       sagemaker_session=sess)

tree.fit(training_input)

The above script is working, for sure. I have print statements in my script and they are printing the expected results to the console. This runs as it's supposed to, finishes up, and says that it's deploying model artifacts when IT DEFINITELY DOES NOT.

Model Deployment:

model = tree.create_model()
predictor = tree.deploy(1, 'ml.m4.xlarge')

This throws an error that the model can't be found. A call to aws sagemaker describe-training-job shows that the training was completed but I found that the time it took to upload the model was super fast, so obviously there's an error somewhere and it's not telling me. Thankfully it's not just uploading it to the aether.

{
            "Status": "Uploading",
            "StartTime": 1595982984.068,
            "EndTime": 1595982989.994,
            "StatusMessage": "Uploading generated training model"
        },

Here's what I've tried so far:

I've tried uploading it to a different bucket. I figured my permissions were the problem so I pointed it to one that I new allowed me to upload as I had done it before to that bucket. No dice.
I tried backporting the script to Python 2.x, but that caused more problems than it probably would have solved, and I don't really see how that would be the problem anyways.
I made sure the Notebook's IAM role has sufficient permissions, and it does have a SagemakerFullAccess policy

What bothers me is that there's no error log I can see. If I could be directed to that I would be happy too, but if there's some hidden Sagemaker kungfu that I don't know about I would be forever grateful.

EDIT

The training job runs and prints to both the Jupyter cell and CloudWatch as expected. I've since lost the cell output in the notebook but below is the last few lines in CloudWatch. The first number is the epoch and the rest are various custom model metrics.

Answer 1

Can you verify from the training job logs that your training script is running? It doesn't look like your Docker image would respond to the command train , which is what SageMaker requires, and so I suspect that your model isn't actually getting trained/saved to /opt/ml/model .

AWS documentation about how SageMaker runs the Docker container: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html

edit: summarizing from the comments below - the training script must also save the model to /opt/ml/model (the model isn't saved automatically).

Answer 2

Have you tried saving to a local file and moving it to S3? I would save it locally (to the root directory of the script) and upload it via boto3.

The sagemaker session object may not have a bucket attributes initialized. Doing it explicitly isn't much an extra step.

import boto3

s3 = boto3.client('s3')
with open("FILE_NAME", "rb") as f:
    s3.upload_fileobj(f, "BUCKET_NAME", "DESTINATION_NAME(optional)")

Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path

Question

2 answers

solution1
1 ACCPTED 2020-07-29 16:14:30

solution2
0 2020-07-29 03:20:54

Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path

Question

2 answers

solution1 1 ACCPTED 2020-07-29 16:14:30

solution2 0 2020-07-29 03:20:54

solution1
1 ACCPTED 2020-07-29 16:14:30

solution2
0 2020-07-29 03:20:54