简体   繁体   English

AWS ECS 服务任务被替换为(请求超时的原因)

[英]AWS ECS service Tasks getting replaced with (reason Request timed out)

We are running ECS as container orchestration layer for more than 2 years.我们将 ECS 作为容器编排层运行了 2 年多。 But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as但是有一个问题我们无法找出原因,在我们的少数(node.js)服务中,我们已经开始观察 ECS 事件中的错误

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)

This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.这导致我们的依赖服务开始经历 504 网关超时,这对它们产生了很大的影响。

  1. Upgraded Docker storage driver from devicemapper to overlay2将 Docker 存储驱动从 devicemapper 升级到 overlay2

  2. We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.正如我们在少数容器中看到的那样,我们增加了所有 ECS 实例的资源,包括 CPU、RAM 和 EBS 存储。

  3. We increase health check grace period for the service from 0 to 240secs我们将服务的健康检查宽限期从 0 秒增加到 240 秒

  4. Increased KeepAliveTimeout and SocketTimeout to 180 secs将 KeepAliveTimeout 和 SocketTimeout 增加到 180 秒

  5. Enabled awslogs on containers instead of stdout, but there was no unusual behavior在容器上启用 awslogs 而不是 stdout,但没有异常行为

  6. Enabled ECSMetaData at container and pipelined all information in our application logs.在容器中启用 ECSMetaData 并在我们的应用程序日志中传输所有信息。 This helped us in looking all the logs for problematic container only.这有助于我们仅在所有日志中查找有问题的容器。

  7. Enabled container insights for better container level debugging启用容器洞察力以实现更好的容器级调试

Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.如果将 devicemapper 升级到 overlay2 存储驱动程序并增加 healthcheck 宽限期,这其中帮助最大的事情。

The amount of errors have come down amazingly with these two but still we are getting this issue once a while.这两个错误的数量惊人地减少了,但我们仍然偶尔会遇到这个问题。

We have seen all the graphs related to instance and container which went down below are the logs for it:我们已经看到了所有与实例和容器相关的图表,下面是它的日志:

ECS container insights logs for victim container:受害容器的 ECS 容器洞察日志:

Query:询问:

fields CpuUtilized, MemoryUtilized, @message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"

Example Logs answered:示例日志回答:

{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}

None of logs were having CPU and Memory utilised ridiculously high.没有日志的 CPU 和 Memory 利用率高得离谱。

We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins我们在 t1 停止从受害容器获得响应,在 t1+2 分钟我们在依赖服务中遇到错误,在 t1+3 分钟容器被 ECS 拿走

Our health check configurations are below:我们的健康检查配置如下:

Protocol HTTP
Path  /healthcheck
Port traffic port
Healthy threshold  10
Unhealthy threshold 2
Timeout  5
Interval 10
Success codes 200

Let me know if you need any more information, I will be happy to provide it.如果您需要更多信息,请告诉我,我很乐意提供。 Configurations which we are running are:我们正在运行的配置是:

docker info
Containers: 11
 Running: 11
 Paused: 0
 Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

There should some indication about resource contention or service crashing or genuine.network failure to explain all this.应该有一些关于资源争用或服务崩溃或 genuine.network 失败的迹象来解释这一切。 But as mentioned there was nothing which we got to know caused any issue.但正如所提到的,我们所知道的并没有导致任何问题。

Your steps from 1 to 7 almost no thing do with the error.您从 1 到 7 的步骤几乎与错误无关。

service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)服务示例服务(实例 i-016b0a460d9974567)(端口 1047)由于(原因请求超时)在目标组示例服务中不正常

The error is very clear, you ECS service is not reachable to Load balancer health check.错误很明显,您的 ECS 服务无法访问负载均衡器健康检查。

Target Group Unhealthy目标群体不健康

When this is the case, go straight and check the container SG, Port, application status or health status code.出现这种情况时,go 直接检查容器的SG、Port、应用程序状态或健康状态码。

Possible reason可能的原因

  • There might be the case, there is no route Path /healthcheck in the backend service可能有这种情况,后端服务中没有路由Path /healthcheck
  • The status code from /healthcheck is not 200 /healthcheck的状态码不是200
  • Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080可能是目标端口无效,请正确配置,如果应用程序在端口 8080 或 3000 上运行,则应为30008080
  • The security group is not allowing traffic on the target group安全组不允许目标组上的流量
  • Application is not running in the container应用程序未在容器中运行

These are the possible reason when there is a timeout from health check.这些是健康检查超时的可能原因。

I faced the same issue of ( Reason request timeout ).我遇到了同样的问题( Reason request timeout )。 I managed to solve it by updating my security-group inbound rules.我设法通过更新我的安全组入站规则来解决它。 Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.目前,入站规则中没有定义规则,所以我暂时为 ipv4 规则添加了一般允许所有流量,因为我当时正在开发中。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM