[英]How to fix intermittent 503 Service Unavailable after idling/redeployments on AWS HTTP API Gateway & Fargate/ECS?
We've got a quite simple setup which causes us major headaches:我们有一个非常简单的设置,这让我们很头疼:
ANY /api/{proxy+}
route to a Fargate Service/Tasks accessible via Cloud Map ANY /api/{proxy+}
route to a Fargate Service/Tasks accessible via Cloud Mapawsvpc
. awsvpc
公开端口 8080 的容器任务。 No autoscaling.SRV
DNS record with TTL 60
TTL 60
的SRV
DNS 记录的服务发现 We receive intermittent HTTP 503 Service Unavailable
for some of our requests.我们收到间歇性
HTTP 503 Service Unavailable
对于我们的某些请求。 A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.新的部署(带有任务重新部署)会提高速度,但即使在 10-15 分钟后,它们仍然会间歇性地发生。
In Cloud Watch we see the failing 503 Requests在 Cloud Watch 中,我们看到失败的 503 请求
2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=
but it seems like they do not reach a living backend instance.但似乎他们没有到达一个活的后端实例。
We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).我们启用了 VPC 流日志,似乎HTTP API 网关希望将一些请求路由到停止的任务,即使它们已经很久了(远远超过 60 秒)。
More puzzling: If we keep the system busy, the rate drops to nearly zero.更令人费解的是:如果我们让系统保持忙碌状态,速率会下降到几乎为零。 Otherwise after a longer period of idling the intermittent errors seem to reoccur.
否则,在较长时间的闲置之后,间歇性错误似乎再次发生。
I was facing this issues and solved it by configuring my ALB being internal , instead of internet-facing (regarding the scheme ).我正面临这个问题,并通过将我的 ALB 配置为internal而不是面向互联网(关于方案)来解决它。 Hope it may help someone with the same issue.
希望它可以帮助有同样问题的人。
Context: The environment is API Gateway + ALB(ECS)上下文:环境为 API 网关 + ALB(ECS)
Update The first ALB I configured was to manage my backend services.更新我配置的第一个 ALB 是为了管理我的后端服务。 Recently I also did another ALB(to deal with my front-end instances), in this case, I exposed a public IP(instead of just a private one).
最近我还做了另一个 ALB(处理我的前端实例),在这种情况下,我暴露了一个公共 IP(而不仅仅是一个私有 IP)。 This could be achieved by changing the scheme to internet-facing , at first I thought this would bring the same problem as I had before, then I figured that it was something pretty simple.
这可以通过将方案更改为面向互联网来实现,起初我认为这会带来与以前相同的问题,然后我认为这很简单。 I just needed to add a policy to allow traffic from the internet to the ALB I created.
我只需要添加一个策略以允许从 Internet 到我创建的 ALB 的流量。
Though we were never able to really pinpoint down the issue we've come to the conclusion that this was a combination of尽管我们从未能够真正确定问题所在,但我们得出的结论是,这是
Replacing the API Gateway with Cloudfront functionality and introducing an AWS Application Load Balancer switched the method for service discovery: Instead of a Route 53 zone the ELB manages the available ECS/Fargate tasks on their own.用 Cloudfront 功能替换 API 网关并引入 AWS 应用程序负载均衡器切换了服务发现方法:ELB 代替 Route 53 区域,自行管理可用的 ECS/Fargate 任务。 This salvaged this issue for us besides a few other minor ones.
除了其他一些小问题之外,这为我们挽救了这个问题。
What worked for me was, in addition to configuring my ALB's scheme as internal as xaalves did, also putting the ALB in an Isolated or a Private subnet.对我有用的是,除了像xaalves那样将 ALB 的方案配置为内部方案外,还将 ALB 置于隔离或私有子网中。 Previously I had my ALB in Public subnets.
以前我在公共子网中有我的 ALB。 bentolor 's experience got me thinking that some sort of DNS resolution was going haywire, and sure enough that appeared to be the case.
本托洛尔的经历让我想到某种 DNS 分辨率正在失控,果然情况确实如此。 Now 100% of my HTTP calls complete successfully.
现在我的 HTTP 调用 100% 成功完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.