簡體   English   中英

Sagemaker 培訓工作致命錯誤:無法打開文件“train”:沒有這樣的文件或目錄

[英]Sagemaker training job Fatal error: cannot open file 'train': No such file or directory

我正在嘗試使用您自己的 model。我有 R 代碼。 當我嘗試運行該作業時,它失敗了。

訓練圖像:

FROM r-base:3.6.3

MAINTAINER Amazon SageMaker Examples <amazon-sagemaker-examples@amazon.com>

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    wget \
    r-base \
    r-base-dev \
    apt-transport-https \
    ca-certificates \
    python3 python3-dev pip

ENV AWS_DEFAULT_REGION="us-east-2"
RUN R -e "install.packages('reticulate', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('readr', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('dplyr', dependencies = TRUE, warning = function(w) stop(w))"

RUN pip install --quiet --no-cache-dir \
    'boto3>1.0<2.0' \
    'sagemaker>2.0<3.0'    

ENTRYPOINT ["/usr/bin/Rscript"]

源代碼:

rcode
    └── train.R
    └── train.tar.gz

建造

- aws s3 cp $CODEBUILD_SRC_DIR/rcode/ s3://${self:custom.deploymentBucket}/${self:service}/code/training --recursive

無服務器.com yaml

           SagemakerRCodeTrainingStep:
            Type: Task
            Resource: ${self:custom.sageMakerTrainingJob}
            Parameters:
              TrainingJobName.$: "$.sageMakerTrainingJobName"
              DebugHookConfig:
                S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
              AlgorithmSpecification:
                TrainingImage: ${self:custom.sagemakerRExecutionContainerURI}
                TrainingInputMode: "File"
              OutputDataConfig:
                S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
              StoppingCondition:
                MaxRuntimeInSeconds: ${self:custom.maxRuntime}
              ResourceConfig:
                InstanceCount: 1
                InstanceType: "ml.m5.xlarge"
                VolumeSizeInGB: 30
              RoleArn: ${self:custom.stateMachineRoleARN}
              InputDataConfig:
                - DataSource:
                    S3DataSource:
                      S3DataType: "S3Prefix"
                      S3Uri: "s3://${self:custom.datasetsFilePath}/data/processed/train"
                      S3DataDistributionType: "FullyReplicated"
                  ChannelName: "train"
              HyperParameters:
                sagemaker_submit_directory: "s3://${self:custom.deploymentBucket}/${self:service}/code/training/train.tar.gz"
                sagemaker_program: "train.R"
                sagemaker_enable_cloudwatch_metrics: "false"
                sagemaker_container_log_level: "20"
                sagemaker_job_name: "sagemaker-r-learn-2022-02-28-09-56-33-234"
                sagemaker_region: ${self:provider.region}

我不確定您使用的是哪個TrainingImage以及容器中的所有文件。 話雖如此,我懷疑您正在使用自定義容器。

SageMaker Training Jobs 查找train文件並運行您的容器, 如下所示

docker run image train

您可以通過在 Dockerfile 中設置ENTRYPOINT來更改此行為。請參閱r_byo_r_algo_hpo示例中的示例Dockerfile

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM