AWS ECS Fargate Memory Utilization vs Local Docker

Question

We are using AWS Fargate ECS Tasks for our spring webflux java 11 microservice.We are using a FROM gcr.io/distroless/java:11 java image. When our application is dockerised locally and deployed as a image inside a docker container the memory utilization is quite efficient and we can see the heap usage never crosses 50%

However when we deploy the same image using the same dockerfile in AWS Fargate as a ECS task the AWS Dashbaord shows a completely different picture.The memory utilization never comes down and Cloudwatch logs show no OutOfMemory issues at all. In AWS ECS, once deployed we have done a Peak load test, a stress test after which the memory utilization reached 94% and then did a soak test for 6 hrs. The memory utilization was still 94% without any OOM errors.Memory the garbage collection is happening constantly and not letting the application go OOM.But it stays at 94%

For testing the application's memory utilization locally we are using Visual VM. We are also trying to connect to the remote ECS task in AWS Fargate using Amazon ECS Exec but that is work in progress

We have seen the same issue with other microservices in our and other clusters as well.Once it reaches a maximum number it never comes down.Kindly help if someone has faced the same issue earlier

Edit on 10/10/2022: We connected to AWS Fargate ECS task using the Amazon ECS Exec and below were the findings

We analysed the GC logs of the AWS ECS Fargate Task and could see the messages.It uses the default GC ie Simple GC. We keep getting "Pause Young Allocation Failure" which means that the memory assigned to the Young Generation is not enough and hence the GC fails.

[2022-10-09T13:33:45.401+0000][1120.447s][info][gc] GC(1417) Pause Full (Allocation Failure) 793M->196M(1093M) 410.170ms [2022-10-09T13:33:45.403+0000][1120.449s][info][gc] GC(1416) Pause Young (Allocation Failure) 1052M->196M(1067M) 460.286ms

We made some code changes associated to byteArray getting copied in memory twice and the memory did come down but not by much

/app # ps -o pid,rss
PID   RSS
    1 1.4g
   16  16m
   30  27m
  515  23m
  524  688
 1655    4
/app # ps -o pid,rss
PID   RSS
    1 1.4g
   16  15m
   30  27m
  515  22m
  524  688
 1710    4

Even after a full gc like below the memory does not come down:

2022-10-09T13:39:13.460+0000][1448.505s][info][gc] GC(1961) Pause Full (Allocation Failure) 797M->243M(1097M) 502.836ms

One important observation was that after running inspect heap, a full gc got trigerred and even that didnt clear up the memory.It shows 679M->149M but the ps -o pid,rss command does not show the drop neither does the AWS Container Insights graph

2022-10-09T13:54:50.424+0000][2385.469s][info][gc] GC(1967) Pause Full (Heap Inspection Initiated GC) 679M->149M(1047M) 448.686ms [2022-10-09T13:56:20.344+0000][2475.390s][info][gc] GC(1968) Pause Full (Heap Inspection Initiated GC) 181M->119M(999M) 448.699ms

Answer 1

How are you running it locally do you set any parameters (cpu/memory) for the container you launch? On Fargate there are multiple levels of resource configurations (size of the task and amount of resources you assign to the container - check out this blog for more details). Also the other thing to consider is that, with Fargate, you may land on an instance with >> capacity than the task size you configured. Fargate will create a cgroup that will box your container(s) to that size but some old programs (and java versions) are not cgroup-aware and they may assume the amount of memory you have is the memory available on the instance (that you don't see) and not the task size (and cgroup) that was configured.

I don't have an exact answer (and this did not fit into a comment) but this may be an area you can explore (being able to exec into the container should help - ECS exec is great for that).

AWS ECS Fargate Memory Utilization vs Local Docker

Question

1 answers

solution1
1 2022-10-01 14:27:03

AWS ECS Fargate Memory Utilization vs Local Docker

Question

1 answers

solution1 1 2022-10-01 14:27:03

solution1
1 2022-10-01 14:27:03