简体   繁体   English

如何在 AWS HTTP API 网关和 Fargate/ECS 上空闲/重新部署后修复间歇性 503 服务不可用?

[英]How to fix intermittent 503 Service Unavailable after idling/redeployments on AWS HTTP API Gateway & Fargate/ECS?

We've got a quite simple setup which causes us major headaches:我们有一个非常简单的设置,这让我们很头疼:

  1. HTTP API Gateway with a S3 Integration for our static HTML/JS and a ANY /api/{proxy+} route to a Fargate Service/Tasks accessible via Cloud Map HTTP API Gateway with a S3 Integration for our static HTML/JS and a ANY /api/{proxy+} route to a Fargate Service/Tasks accessible via Cloud Map
  2. ECS Cluster with a "API service" using Fargate and a Container Task exposing Port 8080 via awsvpc . ECS 集群具有使用Fargate“API 服务”和通过awsvpc公开端口 8080 的容器任务。 No autoscaling.没有自动缩放。 Min healthy: 100%, Max: 200%.最小健康:100%,最大:200%。
  3. Service discovery using SRV DNS record with TTL 60使用TTL 60SRV DNS 记录的服务发现
  4. The ECS service/tasks is completely bored/idling and always happy to accept requests while logging them. ECS 服务/任务完全无聊/空闲,并且在记录请求时总是乐于接受请求。

Problem:问题:

We receive intermittent HTTP 503 Service Unavailable for some of our requests.我们收到间歇性HTTP 503 Service Unavailable对于我们的某些请求。 A new deployment (with task redeployment) increases the rate, but even after 10-15 minutes they still occur intermittently.新的部署(带有任务重新部署)会提高速度,但即使在 10-15 分钟后,它们仍然会间歇性地发生。

In Cloud Watch we see the failing 503 Requests在 Cloud Watch 中,我们看到失败的 503 请求

2020-06-05T14:19:01.810+02:00 xx.117.163.xx - - [05/Jun/2020:12:19:01 +0000] "GET ANY /api/{proxy+} HTTP/1.1" 503 33 Np24bwmwsiasJDQ=

but it seems like they do not reach a living backend instance.但似乎他们没有到达一个活的后端实例。

We enabled VPC Flow Logs and it seems that HTTP API Gateway wants to route some requests to stopped tasks even after they've gone long for good (far exceeding 60s).我们启用了 VPC 流日志,似乎HTTP API 网关希望将一些请求路由到停止的任务,即使它们已经很久了(远远超过 60 秒)。

More puzzling: If we keep the system busy, the rate drops to nearly zero.更令人费解的是:如果我们让系统保持忙碌状态,速率会下降到几乎为零。 Otherwise after a longer period of idling the intermittent errors seem to reoccur.否则,在较长时间的闲置之后,间歇性错误似乎再次发生。

Questions问题

  1. How can we fix this issue?我们如何解决这个问题?
  2. Are there options to further pinpoint the root issue?是否有进一步查明根本问题的选项?

I was facing this issues and solved it by configuring my ALB being internal , instead of internet-facing (regarding the scheme ).我正面临这个问题,并通过将我的 ALB 配置为internal而不是面向互联网(关于方案)来解决它。 Hope it may help someone with the same issue.希望它可以帮助有同样问题的人。

Context: The environment is API Gateway + ALB(ECS)上下文:环境为 API 网关 + ALB(ECS)

Update The first ALB I configured was to manage my backend services.更新我配置的第一个 ALB 是为了管理我的后端服务。 Recently I also did another ALB(to deal with my front-end instances), in this case, I exposed a public IP(instead of just a private one).最近我还做了另一个 ALB(处理我的前端实例),在这种情况下,我暴露了一个公共 IP(而不仅仅是一个私有 IP)。 This could be achieved by changing the scheme to internet-facing , at first I thought this would bring the same problem as I had before, then I figured that it was something pretty simple.这可以通过将方案更改为面向互联网来实现,起初我认为这会带来与以前相同的问题,然后我认为这很简单。 I just needed to add a policy to allow traffic from the internet to the ALB I created.我只需要添加一个策略以允许从 Internet 到我创建的 ALB 的流量。

Though we were never able to really pinpoint down the issue we've come to the conclusion that this was a combination of尽管我们从未能够真正确定问题所在,但我们得出的结论是,这是

  • temporary internal AWS issues causing long delays for HTTP API Gateway to adopt Route 53 zone updates (used for service discovery) and临时内部 AWS 问题导致 HTTP API 网关采用 Route 53 区域更新(用于服务发现)和
  • the absence of an Elastic Load Balancer (ELB)缺少弹性负载均衡器 (ELB)

Replacing the API Gateway with Cloudfront functionality and introducing an AWS Application Load Balancer switched the method for service discovery: Instead of a Route 53 zone the ELB manages the available ECS/Fargate tasks on their own.用 Cloudfront 功能替换 API 网关并引入 AWS 应用程序负载均衡器切换了服务发现方法:ELB 代替 Route 53 区域,自行管理可用的 ECS/Fargate 任务。 This salvaged this issue for us besides a few other minor ones.除了其他一些小问题之外,这为我们挽救了这个问题。

What worked for me was, in addition to configuring my ALB's scheme as internal as xaalves did, also putting the ALB in an Isolated or a Private subnet.对我有用的是,除了像xaalves那样将 ALB 的方案配置为内部方案外,还将 ALB 置于隔离私有子网中。 Previously I had my ALB in Public subnets.以前我在公共子网中有我的 ALB。 bentolor 's experience got me thinking that some sort of DNS resolution was going haywire, and sure enough that appeared to be the case.本托洛尔的经历让我想到某种 DNS 分辨率正在失控,果然情况确实如此。 Now 100% of my HTTP calls complete successfully.现在我的 HTTP 调用 100% 成功完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM