简体   繁体   English

SageMaker Endpoint:调用 InvokeEndpoint 操作时 ServiceUnavailable 503

[英]SageMaker Endpoint: ServiceUnavailable 503 when calling the InvokeEndpoint operation

I've deployed a model as SageMaker Endpoint, it worked fine for some time but now when invoking the model through boto3我已经部署了一个 model 作为 SageMaker 端点,它在一段时间内运行良好,但现在通过 boto3 调用 model 时

import boto3

client = boto3.client('sagemaker-runtime')

response = client.invoke_endpoint(
        EndpointName="my-sagemaker-endpoint",
        ContentType="text/csv",
        Body=payload,
)

I got the following error我收到以下错误

ServiceUnavailable: An error occurred (ServiceUnavailable) when calling the InvokeEndpoint operation (reached max retries: 4): A transient exception occurred while retrieving variant instances. Please try again later.

Researching about this error in SageMaker Documentation it states the followingSageMaker 文档中研究此错误,它指出以下内容

The request has failed due to a temporary failure of the server.由于服务器的临时故障,请求失败。

I've also checked the Instance Metrics in CW and there's nothing unusual.我还检查了 CW 中的实例指标,没有任何异常。

I'm not sure why this error is happening, any suggestions will be helpful.我不确定为什么会发生此错误,任何建议都会有所帮助。

TL; TL; DR The error originates because the Instance is unable to retrieve the SageMaker Model artifact from s3. DR 错误源于实例无法从 s3 检索 SageMaker Model 工件。

Explanation解释

SageMaker Endpoints implement a /ping route which check if model artifact is able to load within the Instance. SageMaker 端点实施/ping路由,检查 model 工件是否能够在实例中加载。 The model artifacts is first retrieved from s3 and then loaded into the instance.首先从 s3 检索 model 工件,然后将其加载到实例中。 If model is not available on s3 it shows the following error (image below)如果 model 在 s3 上不可用,它会显示以下错误(下图) 在此处输入图像描述

As the model artifact can't be retrieved from s3 because it was accidentally deleted, it can't be loaded which raises the error No such file or directory when calling the /ping route to check if the endpoint is healthy (see image below)由于 model 工件由于被意外删除而无法从 s3 检索,因此无法加载它,从而在调用/ping路由检查端点是否健康时引发错误No such file or directory (见下图)

在此处输入图像描述

This in turn makes the Load Balancer to assume the instance has some problem, blocking you access to it, so when you try to invoke the endpoint you get a 503: Service Unavailable Error这反过来使负载均衡器假设实例有问题,阻止您访问它,因此当您尝试调用端点时,您会收到503: Service Unavailable Error

在此处输入图像描述

Solution解决方案

I worked this out only by redeploying to a new endpoint but this time considering the following:我只是通过重新部署到一个新端点来解决这个问题,但这次考虑了以下几点:

  • At least num_instances=2 to guarantee each instance is at a different AZ, and the LB communicates with at least a healthy instance.至少num_instances=2以保证每个实例都在不同的 AZ,LB 至少与一个健康的实例通信。
  • Ensure only specific roles have s3:PutObject permission on the s3 model artifacts route models/model-name/version确保只有特定角色对 s3 model 工件路由models/model-name/version具有s3:PutObject权限

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 Glue 作业调用 AWS Sagemaker 终端节点 - Calling AWS Sagemaker endpoint from Glue Job botocore.exceptions.ClientError:调用 HeadObject 操作时发生错误 (403):在 AWS SageMaker 中使用本地模式时禁止 - botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden while using local mode in AWS SageMaker 将预训练的 Tensorflow 模型部署到 sagemaker 中的一个端点(一个端点的多模型)时出错? - Error when deploying pre trained Tensorflow models to one endpoint (multimodel for one endpoint) in sagemaker? 为 PyTorch Model 调用 SageMaker 端点 - Invoking SageMaker Endpoint for PyTorch Model Sagemaker:只读文件系统:/opt/ml/models/../config.json 调用端点时 - Sagemaker: read-only file system: /opt/ml/models/../config.json when invoking endpoint Sagemaker 端点超出大小限制 - Exceeding size limit with Sagemaker endpoint 调用 DescribeTaskDefinition 操作时发生错误(ClientException) - An error occurred (ClientException) when calling the DescribeTaskDefinition operation AWS SageMaker Pipeline Model 端点部署失败 - AWS SageMaker Pipeline Model endpoint deployment failing 如何通过 cloudformation 在 sagemaker 中创建无服务器端点? - how to create a serverless endpoint in sagemaker via cloudformation? 调用 DescribeLaunchTemplates 操作时发生错误(UnauthorizedOperation)? - An error occurred (UnauthorizedOperation) when calling the DescribeLaunchTemplates operation?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM