Training Process Fails At Downloading Input Data

Question

I'm trying to run a training job using a custom Docker image that looks as follows:

FROM 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3

RUN pip install -U spacy boto3

RUN python -m spacy download en_core_web_sm

And I'm creating the training job using this:

import boto3
import sagemaker
from sagemaker import get_execution_role

container_uri = "..."
role = get_execution_role()

model = sagemaker.estimator.Estimator(
        image_uri=container_uri,
        role=role,
        instance_count=1,
        instance_type="ml.m4.xlarge",
        volume_size=1,
        entry_point="train.py",
        source_dir="spacy-train-custom",
        dependencies=["spacy-train-custom/configs"],
    output_path=f"s3://test-bucket/output/"
)

model.fit()

I also tried adding a couple of inputs but the result is the same:

Using role ...
2022-10-06 00:21:15 Starting - Starting the training job...
2022-10-06 00:21:40 Starting - Preparing the instances for trainingProfilerReport-1665015675: InProgress
.........
2022-10-06 00:23:09 Downloading - Downloading input data
2022-10-06 00:23:09 Stopping - Stopping the training job
2022-10-06 00:23:09 Stopped - Training job stopped
ProfilerReport-1665015675: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.

Edit:

This is what the training script looks like:

import os
from sagemaker_training import environment
from spacy.cli.train import train

print("start training")

env = environment.Environment()
config_path = env.hyperparameters.pop("config")

overrides = {
    "paths.train": os.path.join(env.channel_input_dirs["train"], "train.spacy"),
}
overrides.update(env.hyperparameters)

use_gpu: int = 0 if env.num_gpus > 0 else -1

train(
    config_path,
    output_path=env.model_dir,
    use_gpu=use_gpu,
    overrides=overrides,
)

Answer 1

There are a couple of things wrong with your current approach, therefore there is not a single problem/issue to fix. Instead, I'll attempt to provide some guidance on how to deploy a custom container.

Using a custom docker image via requirements.txt

The simplest way of achieving this is to extend an existing sagemaker container. Scikit-Learn images are good base images to extend; however I recommend to use a requirements.txt file with your desired packages, and save the file in the same directory as your training script. Sagemaker will then install your requirements without the need of changing the docker file for the container. Please look sagemaker docs for more information.

Extending a prebuilt container with env variables and Dockefile

Your current Dockerfile is not complete, unless you omitted the complete dockerfile for brevity. As mentioned above, based on your apparent requirements, I do not recommend extending a container by modifying the dockerfile, as you'll see there are multiple requirements to accomplish this vs implementing a requirements.txt.

To extend a pre-built SageMaker image, you need to set the following environment variables within your Dockerfile

SAGEMAKER_SUBMIT_DIRECTORY : The directory within the container in which the Python script for training is located.
SAGEMAKER_PROGRAM : The Python script that should be invoked and used as the entry point for training.

Create and Upload the Dockerfile and Python Training Scripts

Your dockerfile should follow the structure below:

# Downloads ssagemaker's  base image
FROM 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3

ENV PATH="/opt/ml/code:${PATH}"

# install your package 
RUN pip install <library>

# this environment variable is used by the SageMaker image container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code.
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entrypoint 
ENV SAGEMAKER_PROGRAM train.py

Build the container

In the same mdirectory as your dockerfile log in to Docker to access the base container:

! aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

To build the Docker container, run the following Docker build command, including the space followed by a period at the end:

! docker build -t sklearn-extended-container-test .

At this point you should be able to test the docker container locally from your sagemaker notebook or notebook studio

Push the container to ECR

%%sh

# Specify an algorithm name
algorithm_name=<some name>

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Log into Docker
aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

After pushing the image you can retrieve the image under:

ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)

For a detailed description please refer to: Sagemaker Dev Guide

Scikit-learn images and input channels

If you are planning on training/fitting the model with any type of data (text/csv or application/x-protobuf), I strongly suggest you read SageMaker SKLearn Estimator Doc . Note that sagemaker's implementation of this particular type of estimator sagemaker.sklearn.estimator.SKLearn uses the same sagemaker.estimator framework you are using. You should be declaring your input channel directories and passing the data when you call the fit() method. Sagemaker will copy the object into the local directory of the container and it will be accessible "locally". Definitely avoid reading objects outside the container (s3), this is just bad practice and it will cause problems.

Training.py

If you plan on deploying your model (serve) your training script should have these three functions:

input_fn : Takes request data and deserializes the data into an object for prediction.
predict_fn : Takes the deserialized request object and performs
inference against the loaded model.
output_fn : Takes the result of prediction and serializes this
according to the response content type.

You can also omit these and serve the model with a specific inference script instead of the train. But based on your training script it seems your use case is very basic and you can do all in one entry_point.

Good luck!

Answer 2

As you're really running a Scikit-learn container (with light customization), try to use the more high-level SKLearn estimator like shown here , instead of the generic Estimator. This might help you with getting it to work, or provide more guiding error messages.

Training Process Fails At Downloading Input Data

Question

2 answers

solution1
1 ACCPTED 2022-10-06 23:31:06

Using a custom docker image via requirements.txt

Extending a prebuilt container with env variables and Dockefile

Create and Upload the Dockerfile and Python Training Scripts

Build the container

Push the container to ECR

Scikit-learn images and input channels

Training.py

solution2
0 2022-10-07 12:56:24

Training Process Fails At Downloading Input Data

Question

2 answers

solution1 1 ACCPTED 2022-10-06 23:31:06

Using a custom docker image via requirements.txt

Extending a prebuilt container with env variables and Dockefile

Create and Upload the Dockerfile and Python Training Scripts

Build the container

Push the container to ECR

Scikit-learn images and input channels

Training.py

solution2 0 2022-10-07 12:56:24

solution1
1 ACCPTED 2022-10-06 23:31:06

solution2
0 2022-10-07 12:56:24