简体   繁体   中英

Sagemaker training job Fatal error: cannot open file 'train': No such file or directory

I am trying work on bring your own model. I have R code. when i try to run the job its failing.

Training Image:

FROM r-base:3.6.3

MAINTAINER Amazon SageMaker Examples <amazon-sagemaker-examples@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    wget \
    r-base \
    r-base-dev \
    apt-transport-https \
    ca-certificates \
    python3 python3-dev pip

ENV AWS_DEFAULT_REGION="us-east-2"
RUN R -e "install.packages('reticulate', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('readr', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('dplyr', dependencies = TRUE, warning = function(w) stop(w))"

RUN pip install --quiet --no-cache-dir \
    'boto3>1.0<2.0' \
    'sagemaker>2.0<3.0'    

ENTRYPOINT ["/usr/bin/Rscript"]

Source code:

rcode
    └── train.R
    └── train.tar.gz

Build

- aws s3 cp $CODEBUILD_SRC_DIR/rcode/ s3://${self:custom.deploymentBucket}/${self:service}/code/training --recursive

Serverless.com yaml

           SagemakerRCodeTrainingStep:
            Type: Task
            Resource: ${self:custom.sageMakerTrainingJob}
            Parameters:
              TrainingJobName.$: "$.sageMakerTrainingJobName"
              DebugHookConfig:
                S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
              AlgorithmSpecification:
                TrainingImage: ${self:custom.sagemakerRExecutionContainerURI}
                TrainingInputMode: "File"
              OutputDataConfig:
                S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
              StoppingCondition:
                MaxRuntimeInSeconds: ${self:custom.maxRuntime}
              ResourceConfig:
                InstanceCount: 1
                InstanceType: "ml.m5.xlarge"
                VolumeSizeInGB: 30
              RoleArn: ${self:custom.stateMachineRoleARN}
              InputDataConfig:
                - DataSource:
                    S3DataSource:
                      S3DataType: "S3Prefix"
                      S3Uri: "s3://${self:custom.datasetsFilePath}/data/processed/train"
                      S3DataDistributionType: "FullyReplicated"
                  ChannelName: "train"
              HyperParameters:
                sagemaker_submit_directory: "s3://${self:custom.deploymentBucket}/${self:service}/code/training/train.tar.gz"
                sagemaker_program: "train.R"
                sagemaker_enable_cloudwatch_metrics: "false"
                sagemaker_container_log_level: "20"
                sagemaker_job_name: "sagemaker-r-learn-2022-02-28-09-56-33-234"
                sagemaker_region: ${self:provider.region}

I am not sure which TrainingImage you are using and all the files in your container. That being said, I suspect you are using a custom container.

SageMaker Training Jobs look for a train file and run your container as follows :

docker run image train

You can change this behavior by setting the ENTRYPOINT in your Dockerfile. Please see this example Dockerfile from the r_byo_r_algo_hpo example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM