AWS ECS Fargate Memory 利用率与本地 Docker

Question

We are using AWS Fargate ECS Tasks for our spring webflux java 11 microservice.We are using a FROM gcr.io/distroless/java:11 java image.我们正在为我们的 spring webflux java 11 微服务使用 AWS Fargate ECS 任务。我们正在使用 FROM gcr.io/distroless/java:11 java 图像。 When our application is dockerised locally and deployed as a image inside a docker container the memory utilization is quite efficient and we can see the heap usage never crosses 50%当我们的应用程序在本地 dockerised 并作为镜像部署在 docker 容器中时，memory 的利用率非常高，我们可以看到堆使用率从未超过 50%

However when we deploy the same image using the same dockerfile in AWS Fargate as a ECS task the AWS Dashbaord shows a completely different picture.The memory utilization never comes down and Cloudwatch logs show no OutOfMemory issues at all.然而，当我们在 AWS Fargate 中使用相同的 dockerfile 部署相同的图像作为 ECS 任务时，AWS Dashbaord 显示完全不同的画面。memory 利用率从未下降，Cloudwatch 日志根本没有显示内存不足问题。 In AWS ECS, once deployed we have done a Peak load test, a stress test after which the memory utilization reached 94% and then did a soak test for 6 hrs.在 AWS ECS 中，部署后我们进行了峰值负载测试，压力测试之后 memory 利用率达到 94%，然后进行了 6 小时的浸泡测试。 The memory utilization was still 94% without any OOM errors.Memory the garbage collection is happening constantly and not letting the application go OOM.But it stays at 94% memory 利用率仍然是 94%，没有任何 OOM 错误。Memory 垃圾收集不断发生，不让应用程序 go OOM。但它保持在 94%

For testing the application's memory utilization locally we are using Visual VM.为了在本地测试应用程序的 memory 利用率，我们使用 Visual VM。 We are also trying to connect to the remote ECS task in AWS Fargate using Amazon ECS Exec but that is work in progress我们还尝试使用 Amazon ECS Exec 连接到 AWS Fargate 中的远程 ECS 任务，但正在进行中

We have seen the same issue with other microservices in our and other clusters as well.Once it reaches a maximum number it never comes down.Kindly help if someone has faced the same issue earlier我们在我们和其他集群中的其他微服务也看到了同样的问题。一旦达到最大数量，它就永远不会下降。如果有人早些时候遇到过同样的问题，请提供帮助

Edit on 10/10/2022: We connected to AWS Fargate ECS task using the Amazon ECS Exec and below were the findings编辑于 2022 年 10 月 10 日：我们使用 Amazon ECS Exec 连接到 AWS Fargate ECS 任务，结果如下

We analysed the GC logs of the AWS ECS Fargate Task and could see the messages.It uses the default GC ie Simple GC.我们分析了 AWS ECS Fargate Task 的 GC 日志，可以看到消息。它使用默认 GC，即简单 GC。 We keep getting "Pause Young Allocation Failure" which means that the memory assigned to the Young Generation is not enough and hence the GC fails.我们不断收到“Pause Young Allocation Failure”，这意味着分配给年轻一代的 memory 不够用，因此 GC 失败。

[2022-10-09T13:33:45.401+0000][1120.447s][info][gc] GC(1417) Pause Full (Allocation Failure) 793M->196M(1093M) 410.170ms [2022-10-09T13:33:45.403+0000][1120.449s][info][gc] GC(1416) Pause Young (Allocation Failure) 1052M->196M(1067M) 460.286ms [2022-10-09T13:33:45.401+0000][1120.447s][info][gc] GC(1417) 暂停完全（分配失败）793M->196M(1093M) 410.170ms [2022-10-09T13:33 :45.403+0000][1120.449s][info][gc] GC(1416) Pause Young (Allocation Failure) 1052M->196M(1067M) 460.286ms

We made some code changes associated to byteArray getting copied in memory twice and the memory did come down but not by much我们做了一些与 byteArray 相关的代码更改，在 memory 中复制了两次，memory 确实下降了，但下降幅度不大

/app # ps -o pid,rss
PID   RSS
    1 1.4g
   16  16m
   30  27m
  515  23m
  524  688
 1655    4
/app # ps -o pid,rss
PID   RSS
    1 1.4g
   16  15m
   30  27m
  515  22m
  524  688
 1710    4

Even after a full gc like below the memory does not come down:即使在 memory 以下的完整 gc 之后也不会下降：

2022-10-09T13:39:13.460+0000][1448.505s][info][gc] GC(1961) Pause Full (Allocation Failure) 797M->243M(1097M) 502.836ms 2022-10-09T13:39:13.460+0000][1448.505s][info][gc] GC(1961) 完全暂停（分配失败）797M->243M(1097M) 502.836ms

One important observation was that after running inspect heap, a full gc got trigerred and even that didnt clear up the memory.It shows 679M->149M but the ps -o pid,rss command does not show the drop neither does the AWS Container Insights graph一个重要的观察结果是，在运行 inspect heap 之后，触发了一个完整的 gc，甚至没有清除 memory。它显示 679M->149M，但 ps -o pid,rss 命令不显示丢弃，AWS Container Insights 也不显示图形

2022-10-09T13:54:50.424+0000][2385.469s][info][gc] GC(1967) Pause Full (Heap Inspection Initiated GC) 679M->149M(1047M) 448.686ms [2022-10-09T13:56:20.344+0000][2475.390s][info][gc] GC(1968) Pause Full (Heap Inspection Initiated GC) 181M->119M(999M) 448.699ms 2022-10-09T13:54:50.424+0000][2385.469s][info][gc] GC(1967) 完全暂停（堆检查启动的 GC）679M->149M(1047M) 448.686ms [2022-10-09T13: 56:20.344+0000][2475.390s][info][gc] GC(1968) 完全暂停（堆检查启动的 GC）181M->119M(999M) 448.699ms

Answer 1

How are you running it locally do you set any parameters (cpu/memory) for the container you launch?你是如何在本地运行它的？你是否为你启动的容器设置了任何参数（cpu/内存）？ On Fargate there are multiple levels of resource configurations (size of the task and amount of resources you assign to the container - check out this blog for more details).在 Fargate 上有多个级别的资源配置（任务的大小和您分配给容器的资源量 - 查看此博客了解更多详细信息）。 Also the other thing to consider is that, with Fargate, you may land on an instance with >> capacity than the task size you configured.另外要考虑的另一件事是，使用 Fargate，您可能会登陆一个容量 >> 容量大于您配置的任务大小的实例。 Fargate will create a cgroup that will box your container(s) to that size but some old programs (and java versions) are not cgroup-aware and they may assume the amount of memory you have is the memory available on the instance (that you don't see) and not the task size (and cgroup) that was configured. Fargate 将创建一个 cgroup，将您的容器装箱到该大小，但一些旧程序（和 java 版本）不支持 cgroup，它们可能假设您拥有的 memory 的数量是实例上可用的 memory（您看不到）而不是配置的任务大小（和 cgroup）。

I don't have an exact answer (and this did not fit into a comment) but this may be an area you can explore (being able to exec into the container should help - ECS exec is great for that).我没有确切的答案（这不适合发表评论）但这可能是您可以探索的领域（能够执行到容器中应该有所帮助 - ECS exec 非常适合）。

AWS ECS Fargate Memory 利用率与本地 Docker

问题描述

1 个解决方案

解决方案1
1 2022-10-01 14:27:03

AWS ECS Fargate Memory 利用率与本地 Docker

问题描述

1 个解决方案

解决方案1 1 2022-10-01 14:27:03

解决方案1
1 2022-10-01 14:27:03