[英]SageMaker Endpoint: ServiceUnavailable 503 when calling the InvokeEndpoint operation
I've deployed a model as SageMaker Endpoint, it worked fine for some time but now when invoking the model through boto3我已经部署了一个 model 作为 SageMaker 端点,它在一段时间内运行良好,但现在通过 boto3 调用 model 时
import boto3
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName="my-sagemaker-endpoint",
ContentType="text/csv",
Body=payload,
)
I got the following error我收到以下错误
ServiceUnavailable: An error occurred (ServiceUnavailable) when calling the InvokeEndpoint operation (reached max retries: 4): A transient exception occurred while retrieving variant instances. Please try again later.
Researching about this error in SageMaker Documentation it states the following在SageMaker 文档中研究此错误,它指出以下内容
The request has failed due to a temporary failure of the server.
由于服务器的临时故障,请求失败。
I've also checked the Instance Metrics in CW and there's nothing unusual.我还检查了 CW 中的实例指标,没有任何异常。
I'm not sure why this error is happening, any suggestions will be helpful.我不确定为什么会发生此错误,任何建议都会有所帮助。
TL;
TL; DR The error originates because the Instance is unable to retrieve the SageMaker Model artifact from s3.
DR 错误源于实例无法从 s3 检索 SageMaker Model 工件。
SageMaker Endpoints implement a /ping
route which check if model artifact is able to load within the Instance. SageMaker 端点实施
/ping
路由,检查 model 工件是否能够在实例中加载。 The model artifacts is first retrieved from s3 and then loaded into the instance.首先从 s3 检索 model 工件,然后将其加载到实例中。 If model is not available on s3 it shows the following error (image below)
如果 model 在 s3 上不可用,它会显示以下错误(下图)
As the model artifact can't be retrieved from s3 because it was accidentally deleted, it can't be loaded which raises the error No such file or directory
when calling the /ping
route to check if the endpoint is healthy (see image below)由于 model 工件由于被意外删除而无法从 s3 检索,因此无法加载它,从而在调用
/ping
路由检查端点是否健康时引发错误No such file or directory
(见下图)
This in turn makes the Load Balancer to assume the instance has some problem, blocking you access to it, so when you try to invoke the endpoint you get a 503: Service Unavailable Error
这反过来使负载均衡器假设实例有问题,阻止您访问它,因此当您尝试调用端点时,您会收到
503: Service Unavailable Error
I worked this out only by redeploying to a new endpoint but this time considering the following:我只是通过重新部署到一个新端点来解决这个问题,但这次考虑了以下几点:
num_instances=2
to guarantee each instance is at a different AZ, and the LB communicates with at least a healthy instance.num_instances=2
以保证每个实例都在不同的 AZ,LB 至少与一个健康的实例通信。s3:PutObject
permission on the s3 model artifacts route models/model-name/version
models/model-name/version
具有s3:PutObject
权限
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.