[英]sagemaker pytorch "Artifact upload failed"
我正在运行测试 sagemaker pytorch 培训。
它创建估计器成功运行训练。 但是它在运行“上传生成的训练模型”时死了
错误是“训练作业 pytorch-training-2022-12-05-19-45-41-370 错误:失败。原因:ClientError:工件上传失败:写入了太多文件”
estimator = PyTorch( # create the estimator
entry_point="CloudSeg.py",
input_mode="FastFile",
TrainingInputMode='FastFile',
role=role,
py_version="py38",
framework_version="1.11.0",
instance_count=1,
instance_type="ml.g4dn.xlarge",
checkpoint_s3_uri=checkpoint_s3_bucket,
checkpoint_local_path=checkpoint_local_path,
use_spot_instances=use_spot_instances,
max_run=max_run,
max_wait=max_wait,
hyperparameters={"epochs": 1, "backend": "nccl"},
)
estimator.fit({"training": "s3://bucket/DATA/"}) # fit with the training data
拟合结果为:
2022-12-05 19:54:10 Training - Training image download completed. Training in progress.
2022-12-05 19:54:10 Uploading - Uploading generated training model
2022-12-05 19:54:10 Failed - Training job failed
ProfilerReport-1670269542: Stopping
-
UnexpectedStatusException
Traceback (most recent call last)
/tmp/ipykernel_19821/1489485288.py in \<cell line: 1\>()
\----\> 1 estimator.fit({"training": 's3://picard-prov/38-cloud-simple-unet_DATA/'})
...
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3891
3892 if wait:
\-\> 3893 self.\_check_job_status(job_name, description, "TrainingJobStatus")
3894 if dot:
3895 print()
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in \_check_job_status(self, job, desc, status_key_name)
3429 actual_status=status,
3430 )
\-\> 3431 raise exceptions.UnexpectedStatusException(
3432 message=message,
3433 allowed_statuses=\["Completed", "Stopped"\],
UnexpectedStatusException: Error for Training job pytorch-training-2022-12-05-19-45-41-370: **Failed. Reason: ClientError: Artifact upload failed:Too many files are written**
谢谢 !
我尝试摆脱 fastfile 模式。 没用
训练完成后,SageMaker 将处理训练输出,其中包括上传 CloudSeg.py 放置在 /opt/ml/model 中的文件。 检查您最终在这些输出文件夹中放置了多少文件,SageMaker 将代表您上传到 S3(根据错误消息,它太多了)。
/opt/ml/model
/opt/ml/output
作为算法的最后一步,您可以编写代码打印出存储在其中的文件,或者使用SageMaker SSH Helper以交互方式检查正在发生的事情。
我想我解决了这个问题。 我将“checkpoint_s3_bucket”更改为会话默认存储桶。 从那以后就没有得到错误。
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"
# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.