sagemaker pytorch“神器上傳失敗”

Question

我正在運行測試 sagemaker pytorch 培訓。
它創建估計器成功運行訓練。 但是它在運行“上傳生成的訓練模型”時死了

錯誤是“訓練作業 pytorch-training-2022-12-05-19-45-41-370 錯誤：失敗。原因：ClientError：工件上傳失敗：寫入了太多文件”

    estimator = PyTorch(  # create the estimator
        entry_point="CloudSeg.py",
        input_mode="FastFile",
        TrainingInputMode='FastFile',
        role=role,
        py_version="py38",
        framework_version="1.11.0",
        instance_count=1,
        instance_type="ml.g4dn.xlarge",
        checkpoint_s3_uri=checkpoint_s3_bucket,
        checkpoint_local_path=checkpoint_local_path,
        use_spot_instances=use_spot_instances,
        max_run=max_run,
        max_wait=max_wait,
        hyperparameters={"epochs": 1, "backend": "nccl"},
        )

    estimator.fit({"training": "s3://bucket/DATA/"})  # fit with the training data

擬合結果為：

2022-12-05 19:54:10 Training - Training image download completed. Training in progress.
2022-12-05 19:54:10 Uploading - Uploading generated training model
2022-12-05 19:54:10 Failed - Training job failed
ProfilerReport-1670269542: Stopping
-

UnexpectedStatusException                 
Traceback (most recent call last)
/tmp/ipykernel_19821/1489485288.py in \<cell line: 1\>()
\----\> 1 estimator.fit({"training": 's3://picard-prov/38-cloud-simple-unet_DATA/'})
...
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3891
3892         if wait:
\-\> 3893             self.\_check_job_status(job_name, description, "TrainingJobStatus")
3894             if dot:
3895                 print()

\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in \_check_job_status(self, job, desc, status_key_name)
3429                     actual_status=status,
3430                 )
\-\> 3431             raise exceptions.UnexpectedStatusException(
3432                 message=message,
3433                 allowed_statuses=\["Completed", "Stopped"\],

UnexpectedStatusException: Error for Training job pytorch-training-2022-12-05-19-45-41-370: **Failed. Reason: ClientError: Artifact upload failed:Too many files are written**

關於如何解決這個問題的任何想法？

謝謝！

我嘗試擺脫 fastfile 模式。 沒用

Answer 1

訓練完成后，SageMaker 將處理訓練輸出，其中包括上傳 CloudSeg.py 放置在 /opt/ml/model 中的文件。 檢查您最終在這些輸出文件夾中放置了多少文件，SageMaker 將代表您上傳到 S3（根據錯誤消息，它太多了）。
/opt/ml/model
/opt/ml/output
作為算法的最后一步，您可以編寫代碼打印出存儲在其中的文件，或者使用SageMaker SSH Helper以交互方式檢查正在發生的事情。

Answer 2

我想我解決了這個問題。 我將“checkpoint_s3_bucket”更改為會話默認存儲桶。 從那以后就沒有得到錯誤。

  bucket=sagemaker.Session().default_bucket()
  base_job_name="sagemaker-checkpoint-test"
  checkpoint_in_bucket="checkpoints"
  # The S3 URI to store the checkpoints
  checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

sagemaker pytorch“神器上傳失敗”

問題描述

關於如何解決這個問題的任何想法？

2 個解決方案

解決方案1
0 2022-12-06 14:00:00

解決方案2
0 已采納 2022-12-16 14:34:20

sagemaker pytorch“神器上傳失敗”

問題描述

關於如何解決這個問題的任何想法？

2 個解決方案

解決方案1 0 2022-12-06 14:00:00

解決方案2 0 已采納 2022-12-16 14:34:20

解決方案1
0 2022-12-06 14:00:00

解決方案2
0 已采納 2022-12-16 14:34:20