![](/img/trans.png)
[英]Sagemaker training job fails ""FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/annotations.json'"
[英]Sagemaker training job Fatal error: cannot open file 'train': No such file or directory
我正在嘗試使用您自己的 model。我有 R 代碼。 當我嘗試運行該作業時,它失敗了。
訓練圖像:
FROM r-base:3.6.3
MAINTAINER Amazon SageMaker Examples <amazon-sagemaker-examples@amazon.com>
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
r-base \
r-base-dev \
apt-transport-https \
ca-certificates \
python3 python3-dev pip
ENV AWS_DEFAULT_REGION="us-east-2"
RUN R -e "install.packages('reticulate', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('readr', dependencies = TRUE, warning = function(w) stop(w))"
RUN R -e "install.packages('dplyr', dependencies = TRUE, warning = function(w) stop(w))"
RUN pip install --quiet --no-cache-dir \
'boto3>1.0<2.0' \
'sagemaker>2.0<3.0'
ENTRYPOINT ["/usr/bin/Rscript"]
源代碼:
rcode
└── train.R
└── train.tar.gz
建造
- aws s3 cp $CODEBUILD_SRC_DIR/rcode/ s3://${self:custom.deploymentBucket}/${self:service}/code/training --recursive
無服務器.com yaml
SagemakerRCodeTrainingStep:
Type: Task
Resource: ${self:custom.sageMakerTrainingJob}
Parameters:
TrainingJobName.$: "$.sageMakerTrainingJobName"
DebugHookConfig:
S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
AlgorithmSpecification:
TrainingImage: ${self:custom.sagemakerRExecutionContainerURI}
TrainingInputMode: "File"
OutputDataConfig:
S3OutputPath: "s3://${self:custom.deploymentBucket}/${self:service}/models/rmodel"
StoppingCondition:
MaxRuntimeInSeconds: ${self:custom.maxRuntime}
ResourceConfig:
InstanceCount: 1
InstanceType: "ml.m5.xlarge"
VolumeSizeInGB: 30
RoleArn: ${self:custom.stateMachineRoleARN}
InputDataConfig:
- DataSource:
S3DataSource:
S3DataType: "S3Prefix"
S3Uri: "s3://${self:custom.datasetsFilePath}/data/processed/train"
S3DataDistributionType: "FullyReplicated"
ChannelName: "train"
HyperParameters:
sagemaker_submit_directory: "s3://${self:custom.deploymentBucket}/${self:service}/code/training/train.tar.gz"
sagemaker_program: "train.R"
sagemaker_enable_cloudwatch_metrics: "false"
sagemaker_container_log_level: "20"
sagemaker_job_name: "sagemaker-r-learn-2022-02-28-09-56-33-234"
sagemaker_region: ${self:provider.region}
我不確定您使用的是哪個TrainingImage
以及容器中的所有文件。 話雖如此,我懷疑您正在使用自定義容器。
SageMaker Training Jobs 查找train
文件並運行您的容器, 如下所示:
docker run image train
您可以通過在 Dockerfile 中設置ENTRYPOINT
來更改此行為。請參閱r_byo_r_algo_hpo示例中的示例Dockerfile 。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.