AWS ECS 503服务在部署时暂时不可用

Question

I am using Amazon Web Services EC2 Container Service with an Application Load Balancer for my app. 我正在为我的应用程序使用带有应用程序负载均衡器的Amazon Web Services EC2容器服务。 When I deploy a new version, I get 503 Service Temporarily Unavailable for about 2 minutes. 当我部署新版本时，我得到503服务暂时不可用大约2分钟。 It is a bit more than the startup time of my application. 它比我的应用程序的启动时间多一点。 This means that I cannot do a zero-downtime deployment now. 这意味着我现在无法进行零停机部署。

Is there a setting to not use the new tasks while they are starting up? 是否有设置在启动时不使用新任务？ Or what am I missing here? 或者我在这里缺少什么？

UPDATE: 更新：

The health check numbers for the target group of the ALB are the following: ALB的目标组的运行状况检查编号如下：

Healthy threshold:     5
Unhealthy threshold:   2
Timeout:               5 seconds
Interval:              30 seconds
Success codes:         200 OK

Healthy threshold is 'The number of consecutive health checks successes required before considering an unhealthy target healthy' 健康阈值是'在考虑健康的不健康目标之前所需的连续健康检查成功次数'
Unhealthy threshold is 'The number of consecutive health check failures required before considering a target unhealthy.' 不健康的阈值是“在考虑目标不健康之前所需的连续健康检查失败次数”。
Timeout is 'The amount of time, in seconds, during which no response means a failed health check.' 超时是'没有响应意味着健康检查失败的时间量，以秒为单位。'
Interval is 'The approximate amount of time between health checks of an individual target' 间隔是'单个目标的健康检查之间的大致时间'

UPDATE 2: So, my cluster consists of two EC2 instances, but can scale up if needed. 更新2：所以，我的集群由两个EC2实例组成，但如果需要可以扩展。 The desired and minimum count is 2. I run one task per instance, because my app needs a specific port number. 所需和最小计数为2.我为每个实例运行一个任务，因为我的应用程序需要特定的端口号。 Before I deploy (jenkins runs an aws cli script) I set the number of instances to 4. Without this, AWS cannot deploy my new tasks (this is another issue to solve). 在我部署之前（jenkins运行aws cli脚本）我将实例数设置为4.如果没有这个，AWS就无法部署我的新任务（这是另一个需要解决的问题）。 Networking mode is bridge. 网络模式是桥梁。

Answer 1

So, the issue seems to lie in the port mappings of my container settings in the task definition. 因此，问题似乎在于任务定义中容器设置的端口映射。 Before I was using 80 as host and 8080 as container port. 在我使用80作为主机和8080作为容器端口之前。 I thought I need to use these, but the host port can be any value actually. 我以为我需要使用这些，但主机端口实际上可以是任何值。 If you set it to 0 then ECS will assign a port in the range of 32768-61000 and thus it is possible to add multiple tasks to one instance. 如果将其设置为0，则ECS将分配32768-61000范围内的端口，因此可以将多个任务添加到一个实例。 In order for this to work, I also needed to change my security group letting traffic come from the ALB to the instances on these ports. 为了实现这一点，我还需要更改我的安全组，让流量从ALB流向这些端口上的实例。
So, when ECS can run multiple tasks on the same instance, the 50/200 min/max healthy percent makes sense and it is possible to do a deploy of new task revision without the need of adding new instances. 因此，当ECS可以在同一个实例上运行多个任务时，50/200最小/最大健康百分比是有意义的，并且可以在不需要添加新实例的情况下部署新任务修订。 This also ensures the zero-downtime deployment. 这也确保了零停机时间的部署。

Thank you for everybody who asked or commented! 感谢所有提出要求或评论的人！

Answer 2

Since you are using AWS ECS may I ask what is the service's "minimum health percent" and "maximum health percent" 由于您使用的是AWS ECS，请问服务的“最低健康百分比”和“最高健康百分比”是多少？

Make sure that you have "maximum health percent" of 200 and "minimum health percent" of 50 so that during deployment not all of your services go down. 确保您的“最大运行状况百分比”为200，“最低运行状况百分比”为50，以便在部署期间不会使所有服务都停止运行。

Please find the documentation definition of these two terms: 请查看这两个术语的文档定义：

Maximum percent provides an upper limit on the number of running tasks during a deployment enabling you to define the deployment batch size. 最大百分比提供部署期间正在运行的任务数的上限，使您可以定义部署批处理大小。

Minimum healthy percent provides a lower limit on the number of running tasks during a deployment enabling you to deploy without using additional cluster capacity. 最低运行百分比在部署期间提供运行任务数量的下限，使您无需使用其他群集容量即可进行部署。

A limit of 50 for "minimum health percent" will make sure that only half of your services container gets killed before deploying the new version of the container, ie if the desired task value of the service is "2" than at the time of deployment only "1" container with old version will get killed first and once the new version is deployed the second old container will get killed and a new version container deployed. “最小健康百分比”限制为50将确保在部署新版本的容器之前，只有一半的服务容器被杀死，即，如果服务的所需任务值是“2”而不是部署时只有旧版本的“1”容器才会被杀死，一旦部署了新版本，第二个旧容器将被杀死并部署一个新版本容器。 This will make sure that at any given time there are services handling the request. 这将确保在任何给定时间都有处理请求的服务。

Similarly a limit of 200 for "maximum health percent" tells the ecs-agent that at a given time during deployment the service's container can shoot up to a maximum of double of the desired task. 类似地，“最大健康百分比”的限制为200告诉ecs-agent，在部署期间的给定时间，服务的容器可以达到所需任务的最大值的两倍。

Please let me know in case of any further question. 如有任何进一步的问题，请告诉我。

Answer 3

With your settings, you application start up should take more then 30 seconds in order to fail 2 health checks and be marked unhealthy (assuming first check immediately after your app went down). 使用您的设置，应用程序启动应该花费超过30秒才能使2次健康检查失败并被标记为不健康（假设您的应用程序停机后立即进行首次检查）。 And it will take at least 2 minutes and up to 3 minutes then to be marked healthy again (first check immediately after your app came back online in the best case scenario or first check immediately before your app came back up in the worst case). 并且至少需要2分钟到3分钟才能再次标记为健康状态（在最佳情况下您的应用程序重新联机后立即进行首次检查，或者在最糟糕的情况下，在应用程序恢复之前立即进行首次检查）。

So, a quick and dirty fix is to increase Unhealthy threshold so that it won't be marked unhealthy during updates. 因此，快速而肮脏的修复是增加不健康的阈值，以便在更新期间不会标记为不健康。 And may be decrease Healthy threshold so that it is marked healthy again quicker. 并且可能会降低健康阈值，以便更快地标记为健康。

But if you really want to achieve zero downtime, then you should use multiple instances of your app and tell AWS to stage deployments as suggested by Manish Joshi (so that there are always enough healthy instances behind your ELB to keep your site operational). 但是如果你真的想要实现零停机时间，那么你应该使用你的应用程序的多个实例，并告诉AWS按照Manish Joshi的建议进行阶段部署（这样你的ELB背后总会有足够健康的实例来保持你的网站运行）。

Answer 4

How i solved this was to have a flat file in the application root that the ALB would monitor to remain healthy. 我如何解决这个问题是在应用程序根目录中有一个平面文件，ALB将监视该文件以保持健康状态。 Before deployment, a script will remove this file while monitoring the node until it registers OutOfService . 在部署之前，脚本将在监视节点时删除此文件，直到它注册OutOfService 。

That way all live connection would have stopped and drained. 这样，所有实时连接都会停止并耗尽。 At this point, the deployment is then started by stopping the node or application process. 此时，通过停止节点或应用程序进程来启动部署。 After deployment, the node is added back to the LB by adding back this flat file and monitored until it registers Inservice for this node before moving to the second node to complete same step above. 部署之后，通过添加回该平面文件将节点添加回LB，并进行监视，直到它为此节点注册Inservice ，然后再移动到第二个节点以完成上述相同步骤。

My script looks as follow 我的脚本如下所示

# Remove Health Check target
echo -e "\nDisabling the ELB Health Check target and waiting for OutOfService\n"
rm -f /home/$USER/$MYAPP/server/public/alive.html

# Loop until the Instance is Out Of Service
while true
do
        RESULT=$(aws elb describe-instance-health --load-balancer-name $ELB --region $REGION --instances $AMAZONID)
        if echo $RESULT | grep -qi OutOfService ; then
                echo "Instance is Deattached"
                break
        fi
        echo -n ". "
        sleep $INTERVAL
done

Answer 5

You were speaking about Jenkins , so I'll answer with the Jenkins master service in mind, but my answer remains valid for any other case (even if it's not a good example for ECS , a Jenkins master doesn't scale correctly, so there can be only one instance). 您正在谈论Jenkins ，所以我会回答Jenkins主服务，但我的答案仍然适用于任何其他情况（即使它不是ECS的一个好例子， Jenkins主人不能正确扩展，所以那里可以只有一个实例）。

503 bad gateway 503糟糕的网关

I often encountered 503 gateway errors related to load balancer failing healthchecks (no healthy instance). 我经常遇到与负载均衡器失败的健康检查相关的503网关错误（没有健康的实例）。 Have a look at your load balancer monitoring tab to ensure that the count of healthy hosts is always above 0. 查看负载均衡器监控选项卡 ，确保健康主机的数量始终高于0。

If you're doing an HTTP healthcheck , it must return a code 200 (the list of valid codes is configurable in the load balancer settings) only when your server is really up and running. 如果您正在进行HTTP运行状况检查 ，则只有在服务器正常启动并运行时，它才必须返回代码200 （有效代码列表可在负载均衡器设置中配置）。 Otherwise the load balancer could put at disposal instances that are not fully running yet. 否则，负载均衡器可能会处置尚未完全运行的处理实例。

If the issue is that you always get a 503 bad gateway , it may be because your instances take too long to answer (while the service is initializing), so ECS consider them as down and close them before their initialization is complete. 如果问题是你总是得到一个503坏网关 ，可能是因为你的实例需要很长时间才能回答（当服务正在初始化时），因此ECS会将它们视为关闭并在初始化完成之前关闭它们。 That's often the case on Jenkins first run. Jenkins第一次运行时经常出现这种情况。

To avoid that last problem, you can consider adapting your load balancer ping target ( healthcheck target for a classic load balancer , listener for an application load balancer ): 为了避免最后一个问题，您可以考虑调整负载均衡器ping目标 （ 经典负载均衡器的 healthcheck目标 ，应用程序负载均衡器的 侦听器）：

With an application load balancer , try with something that will always return 200 (for Jenkins it may be a public file like /robots.txt for example). 使用应用程序负载均衡器 ，尝试使用始终返回200的内容 （对于Jenkins，它可能是一个公共文件，例如/robots.txt）。
With a classic load balancer , use a TCP port test rather than a HTTP test . 使用经典的负载均衡器 ，使用TCP端口测试而不是HTTP测试 。 It will always succeed if you have opened the port correctly. 如果您正确打开了端口，它将始终成功。

One node per instance 每个实例一个节点

If you need to be sure you have only one node per instance, you may use a classic load balancer (it also behaves well with ECS ). 如果您需要确保每个实例只有一个节点，则可以使用经典的负载均衡器 （它也可以很好地与ECS配合使用）。 With classic load balancers , ECS ensures that only one instance runs per server. 使用经典的负载平衡器 ， ECS可确保每台服务器只运行一个实例。 That's also the only solution to have non HTTP ports accessible (for instance Jenkins needs 80, but also 50000 for the slaves). 这也是唯一可以访问非HTTP端口的解决方案（例如Jenkins需要80，但奴隶也需要50000）。

However, as the ports are not dynamic with a classic load balancer, you have to do some port mapping, for example: 但是，由于端口不是动态的经典负载均衡器，您必须进行一些端口映射，例如：

myloadbalancer.mydomain.com:80 (port 80 of the load balancer) -> instance:8081 (external port of your container) -> service:80 (internal port of your container). myloadbalancer.mydomain.com:80（负载均衡器的端口80） - >实例：8081（容器的外部端口） - > service：80（容器的内部端口）。

And of course you need one load balancer per service. 当然，每个服务需要一个负载均衡器。

Jenkins healthcheck 詹金斯健康检查

If that's really a Jenkins service that you want to launch, you should use the Jenkins Metrics plugin to obtain a good healthcheck URL . 如果这是您想要启动的Jenkins服务，您应该使用Jenkins Metrics插件来获得良好的健康检查URL 。

Install it, and in the global options, generate a token and activate the ping, and you should be able to reach an URL looking like this: http://myjenkins.domain.com/metrics/mytoken12b3ad1/ping 安装它，并在全局选项中生成一个令牌并激活ping，你应该能够找到如下所示的URL： http ： //myjenkins.domain.com/metrics/mytoken12b3ad1/ping

This URL will answer the HTTP code 200 only when the server is fully running, which is important for the load balancer to activate it only when it's completely ready. 只有在服务器完全运行时，此URL才会回答HTTP代码200 ，这对于负载均衡器仅在完全就绪时才激活它非常重要。

Logs 日志

Finally, if you want to know what is happening to your instance and why it is failing, you can add logs to see what the container is saying in AWS Cloudwatch . 最后，如果您想知道实例发生了什么以及它失败的原因，您可以添加日志以查看容器在AWS Cloudwatch中的含义 。

Just add this in the task definition (container conf): 只需在任务定义（容器配置）中添加：

Log configuration: awslogs 日志配置： awslogs
awslogs-group: mycompany (the Cloudwatch key that will regroup your container logs) awslogs-group： mycompany （将重新组合容器日志的Cloudwatch密钥）
awslogs-region: us-east-1 (your cluster region) awslogs-region： us-east-1 （你的集群区域）
awslogs-stream-prefix: myservice (a prefix to create the log name) awslogs-stream-prefix： myservice （创建日志名称的前缀）

It will give you more insight about what is happening during a container initialization, if it just takes too long or if it is failing. 它将为您提供有关容器初始化过程中发生的事情的更多信息，如果它只需要太长时间或者它是否失败。

Hope it helps!!! 希望能帮助到你！！！

AWS ECS 503服务在部署时暂时不可用

问题描述

5 个解决方案

解决方案1
9 已采纳 2017-07-20 13:04:53

解决方案2
3 2017-07-12 16:44:01

解决方案3
2 2017-07-12 22:15:05

解决方案4
2 2017-07-12 23:18:01

解决方案5
1 2017-07-19 14:59:56

503 bad gateway 503糟糕的网关

One node per instance 每个实例一个节点

Jenkins healthcheck 詹金斯健康检查

Logs 日志

AWS ECS 503服务在部署时暂时不可用

问题描述

5 个解决方案

解决方案1 9 已采纳 2017-07-20 13:04:53

解决方案2 3 2017-07-12 16:44:01

解决方案3 2 2017-07-12 22:15:05

解决方案4 2 2017-07-12 23:18:01

解决方案5 1 2017-07-19 14:59:56

503 bad gateway 503糟糕的网关

One node per instance 每个实例一个节点

Jenkins healthcheck 詹金斯健康检查

Logs 日志

解决方案1
9 已采纳 2017-07-20 13:04:53

解决方案2
3 2017-07-12 16:44:01

解决方案3
2 2017-07-12 22:15:05

解决方案4
2 2017-07-12 23:18:01

解决方案5
1 2017-07-19 14:59:56