简体   繁体   中英

Training Job is Stopping in Sagemaker

Recently, I have changed account on AWS and faced with weird error in Sagemaker.

Basically, I'm just checking xgboost algo with some toy dataset in this manner:

from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")

clf = sagemaker.estimator.Estimator(xgb_image_uri,
                   role, 1, 'ml.c4.2xlarge',
                   output_path="s3://{}/output".format(session.default_bucket()),
                   sagemaker_session=session)

clf.fit(location_data)

Then the training job is starting to be executed but for some reason, on downloading data step it stops the training job and displays the following message:

2021-10-21 17:33:27 Downloading - Downloading input data
2021-10-21 17:33:27 Stopping - Stopping the training job
2021-10-21 17:33:27 Stopped - Training job stopped
ProfilerReport-1634837444: Stopping
..
Job ended with status 'Stopped' rather than 'Completed'. This could mean the job timed out or stopped early for some other reason: Consider checking whether it completed as you expect.

Also, when I'm trying to go back to training jobs section and check for logs in cloudwatch there is nothing to be displayed. Is it common issue and who had faced with that? Are there any workarounds?

问题很可能出在创建实例之前运行的 sagemaker 模板上。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM