简体   繁体   中英

Kubernetes AutoScaler or changing Desired Nodes in AWS prematurely terminates Docker Pods

I built a service that utilizes docker pods to process data. The time it takes varies from as little as 15 minutes to as much as 1 hour.

My applications captures SIGTERM to ensure a graceful shutdown takes place when demand drops while Pods and Nodes are decommissioned.

In each docker image I placed code to report back if it shutdown because it completed the work and if a SIGTERM event took place and thus completed its processing and terminated.

My system is deployed in AWS using EKS. I use EKS to manage node deployment when demand goes up and spindown nodes when demand drops. I use KEDA to manage POD deployment which is what helps trigger whether additional nodes are needed or not. In KEDA I have the cooldownPeriod defined for 2 hours the maximum I expect a pod to take even though the max it would ever take is 1 hour.

In AWS EKS, I have defined the terminationGracePeriodSeconds for 2 hours as well.

I isolated the issue during Node scale down that when nodes are being terminated, the terminationGracePeriodSeconds is not being honored and my Pods are being shutdown within ~30 minutes. Because the Pods are abruptly removed I am unable to look at their logs to see what happened.

I tried to simulate this issue by issuing a kube.netes node drain and kept my pod running

kubectl drain <MY NODE>

I saw the SIGTERM come through, and I also noticed that the pod was only terminated after 2 hours and not before.

So for a brief minute I thought maybe I did not configure the terminationGracePeriod properly, so I checked:

kubectl get deployment test-mypod -o yaml|grep terminationGracePeriodSeconds
  terminationGracePeriodSeconds: 7200

I even redeployed the config but that made no difference.

However, I was able to reproduce the issue by modifying the desiredSize of the Node group. I can reproduce it programmatically in Python by doing this:

        resp = self.eks_client.update_nodegroup_config(clusterName=EKS_CLUSTER_NAME,
                                                       nodegroupName=EKS_NODE_GROUP_NAME,
                                                       scalingConfig={'desiredSize': configured_desired_size})

or by simply going to AWS console and modifying the desiredSize there.

I see EKS choosing a node and if it happens that there is a pod processing data that will take about an hour, the pod is sometimes prematurely terminated.

I have logged on to that node that is being scaled down and found no evidence of the prematurely terminated Pod in the logs.

I was able to capture this information once

kubectl get events | grep test-mypod-b8dfc4665-zp87t
54m         Normal    Pulling    pod/test-mypod-b8dfc4665-zp87t         Pulling image ...
54m         Normal    Pulled     pod/test-mypod-b8dfc4665-zp87t         Successfully pulled image ...
54m         Normal    Created    pod/test-mypod-b8dfc4665-zp87t         Created container mypod
54m         Normal    Started    pod/test-mypod-b8dfc4665-zp87t         Started container mypod
23m         Normal    ScaleDown  pod/test-mypod-b8dfc4665-zp87t         deleting pod for node scale down
23m         Normal    Killing    pod/test-mypod-b8dfc4665-zp87t         Stopping container mypod
13m         Warning   FailedKillPod   pod/test-po-b8dfc4665-zp87t       error killing pod: failed to "KillContainer" for "mypod" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

I once saw a pod removed for no reason as such when scaledown was disabled but it decided to remove my pod:

kubectl get events | grep test-mypod-b8dfc4665-vxqhv
45m         Normal    Pulling    pod/test-mypod-b8dfc4665-vxqhv Pulling image ...
45m         Normal    Pulled     pod/test-mypod-b8dfc4665-vxqhv Successfully pulled image ...
45m         Normal    Created    pod/test-mypod-b8dfc4665-vxqhv Created container mypod
45m         Normal    Started    pod/test-mypod-b8dfc4665-vxqhv Started container mypod
40m         Normal    Killing    pod/test-mypod-b8dfc4665-vxqhv Stopping container mypod

This is the kuber.nets version I have

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0" GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c49e2", GitCommit:"8c49e2efc3cfbb7788a58025e679787daed22018", GitTreeState:"clean", BuildDate:"2021-10-17T05:13:46Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

To minimize this issue, I deployed a Pod Disruption Budget during peak hours to block scale down and in the evening during low demand I remove the PDB which initiates the scaledown. However, that is not the right solution and even during low peak there are still pods that get stopped prematurely.

When using Amazon EKS, the node autoscaler does not honor the terminationGracePeriodSeconds. Per

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#does-ca-respect-gracefultermination-in-scale-down

The Node Autoscaler only provides a 10 minute grace period. I extracted the relevant text here:

How fast is Cluster Autoscaler?

By default, scale-up is considered up to 10 seconds after pod is marked as unschedulable, and scale-down 10 minutes after a node becomes unneeded. There are multiple flags which can be used to configure these thresholds. For example, in some environments, you may wish to give the k8s scheduler a bit more time to schedule a pod than the CA's scan-interval. One way to do this is by setting --new-pod-scale-up-delay, which causes the CA to ignore unschedulable pods until they are a certain "age", regardless of the scan-interval. If k8s has not scheduled them by the end of that delay, then they may be considered by the CA for a possible scale-up.

Another relevant link: https://github.com/kubernetes/autoscaler/issues/147

I implemented a script to be invoked as a preStop Hook that will hopefully block the next state that issues the SIGTERM and starts the 10 minute countdown to give me a chance to gracefully shutdown my service. However, the preStopHook does not delay the 10 minute timer.

Some references to that setup:

https://www.ithands-on.com/2021/07/kubernetes-101-pods-lifecycle-hooks_30.html

https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/

Instead, I added the following annotation to my pod deployment config, per the following reference:

https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/#prevent-scale-down-eviction

template:
  metadata:
    labels:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'

Then I ensured that my my pods become on demand pods, ie no pods are deployed idle as idle pods impact EKS scale down and only spawned when needed and shutdown when their task is done. This slows my response time for jobs, but that is a smaller price to pay relative to shutting down a Pod amid an expensive compute.

In case anyone is curious on how to deploy an AWS Cluster Autoscaler: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler

It has a reference on also disabling evictions of Pods

Under load we are still seeing that the safe-to-evict annotation is not being honored and reported this back to Amazon AWS. With additional debugging I was able to discover that EKS is seeing nodes hosting the pods disappearing despite EKS ignoring nodes with the safe to evict. There might be an interoperability issue between EKS and EC2. Until this is resolved I am looking into using Fargate as an alternate autoscaler.

We faced the same issue with AWS EKS and cluster-autoscaler - nodes were unexpectedly shut down, no preventive actions were working, and even the node annotation cluster-autoscaler.kubernetes.io/scale-down-disabled=true did not make any difference.

After two days of troubleshooting, we found the reason - it was because we use Multiple Availability Zone in ASG configuration, which has an automatic "AZRebalance" process. The AZRebalance tries to ensure that the number of nodes is approximately the same between all availability zones. Therefore, sometimes when the scale-up event occurs, it tries to rebalance nodes by killing one node and creating another in a different time zone. The message in the events log looks like this:

在此处输入图像描述

Cluster-autoscaler does not control this process, so there are two systems (cluster-autoscaler and AWS ASG) that manage the number of nodes simultaneously, which leads to unexpected behavior.

As a workaround, we suspended the "AZRebalance" process in the ASG. 在此处输入图像描述

Another solution would be to use ASG for each availability zone separately and use --balance-similar-node-groups feature in the cluster-autoscaler.

Here's the article about that and here's the cluster-autoscaler documentation.

We worked with Amazon support to solve this issue. The final resolution was not far from @lub0v answer but there was still a missing component.

Our EKS system had only one node group that managed multiple Availability Zones. Instead I deployed one node group per Availability Zone. Once we did that the TerminationGracePeriod was being honored.

Also, don't forget prior answers I added earlier, ensure your pod annotation contains safe-to-evict set as false

Finaly, use --balance-similar-node-groups in your cluster autoscaler command line parameter if you prefer to have the same number of nodes deployed during upscaling. Currently this parameter is not honored during downscaling.

Reference on autoscaling: https://github.com/kube.netes/autoscaler/blob/master/cluster-autoscaler/FAQ.md

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM