简体   繁体   English

AWS SageMaker 训练作业不保存模型输出

[英]AWS SageMaker Training Job not saving model output

I'm running a training job on SageMaker.我正在 SageMaker 上运行训练作业。 The job doesn't fully complete and hits the MaxRuntimeInSeconds stopping condition.作业未完全完成并达到 MaxRuntimeInSeconds 停止条件。 When the job is stopping, documentation says the artifact will still be saved.当工作停止时,文档说工件仍将被保存。 I've attached the status progression of my training job below.我在下面附上了我的培训工作的状态进展。 It looks like the training job finished correctly.看起来训练工作正确完成。 However the output S3 folder is empty.但是,输出 S3 文件夹是空的。 Any ideas on what is going wrong here?关于这里出了什么问题的任何想法? The training data is located in the same bucket so it should have everything it needs.训练数据位于同一个桶中,所以它应该有它需要的一切。

状态进展

From the status progression, it seems that the training image download completed at 15:33 UTC and by that time the stopping condition was initiated based on the MaxRuntimeInSeconds parameter that you have specified.从状态进展来看,训练图像下载似乎在 15:33 UTC 完成,届时停止条件已根据您指定的MaxRuntimeInSeconds参数启动。 From then, it takes 2 mins (15:33 to 15:35) to save any available model artifact but in your case, the training process did not happen at all.从那时起,保存任何可用的模型工件需要 2 分钟(15:33 到 15:35),但在您的情况下,训练过程根本没有发生。 The only thing that was done was downloading the pre-built image(containing the ML algorithm).唯一要做的就是下载预先构建的图像(包含 ML 算法)。 Please refer the following lines from the documentation which says model being saved is subject to the state the training process is in. May be you can try to increase the MaxRuntimeInSeconds and run the job again.请参考文档中的以下几行内容,其中说正在保存的模型取决于训练过程所处的状态。也许您可以尝试增加 MaxRuntimeInSeconds 并再次运行作业。 Also, please check MaxWaitTimeInSeconds value that you have set if you have.It must be equal to or greater than MaxRuntimeInSeconds .另外,请检查您设置的MaxWaitTimeInSeconds值(如果有)。它必须等于或大于MaxRuntimeInSeconds

Please find the excerpts from AWS documentation :请从AWS 文档中找到摘录:

"The training algorithms provided by Amazon SageMaker automatically save the intermediate results of a model training job when possible. This attempt to save artifacts is only a best effort case as model might not be in a state from which it can be saved. For example, if training has just started, the model might not be ready to save." “Amazon SageMaker 提供的训练算法会在可能的情况下自动保存模型训练作业的中间结果。这种保存工件的尝试只是一种尽力而为的情况,因为模型可能无法处于可以保存的状态。例如,如果训练刚刚开始,模型可能还没有准备好保存。”

If MaxRuntimeInSeconds is exceeded then model upload is only best-effort and really depends on whether the algorithm saved any state to /opt/ml/model at all prior to being terminated.如果超过MaxRuntimeInSeconds ,则模型上传只是尽力而为,并且实际上取决于算法是否在终止之前将任何状态保存到/opt/ml/model

The two minute wait period between 15:33 to 15:35 in the Stopping stage signifies the max time between a SIGTERM and a SIGKILL signal sent to your algorithm (see SageMaker doc for more detail ). Stopping阶段 15:33 到 15:35 之间的两分钟等待时间表示发送到算法的SIGTERMSIGKILL信号之间的最长时间( 有关更多详细信息,请参阅SageMaker 文档)。 If your algorithm traps the SIGTERM it is supposed to use that as a signal to gracefully save its work and shutdown before the SageMaker platform kills it forcibly with a SIGKILL signal 2 minutes later.如果您的算法捕获 SIGTERM,则应该将其用作信号以优雅地保存其工作并在 SageMaker 平台在 2 分钟后用 SIGKILL 信号强行杀死它之前关闭。

Given that the wait period in the Stopping step is exactly 2 minutes as well as the fact Uploading step started at 15:35 and completed almost immediately at 15:35 it's likely that your algo did not take advantage of the SIGTERM warning and that there was nothing saved to /opt/ml/model .鉴于Stopping步骤中的等待时间正好是 2 分钟,并且Uploading步骤从 15:35 开始并在 15:35 几乎立即完成,很可能您的算法没有利用 SIGTERM 警告,并且有没有保存到/opt/ml/model To give you a definitive answer as to whether this was indeed the case please create a SageMaker forum post and the SageMaker team can private-message you to gather details of your job.为了给您一个关于是否确实如此的明确答案,请创建一个SageMaker 论坛帖子,SageMaker 团队可以私信您以收集您工作的详细信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM