Kube.netes AutoScaler 或更改 AWS 中的 Desired Nodes 过早终止 Docker Pod

Question

I built a service that utilizes docker pods to process data.我构建了一个服务，利用 docker 个 pod 来处理数据。 The time it takes varies from as little as 15 minutes to as much as 1 hour.所需时间少则 15 分钟，多则 1 小时。

My applications captures SIGTERM to ensure a graceful shutdown takes place when demand drops while Pods and Nodes are decommissioned.我的应用程序捕获 SIGTERM 以确保在 Pod 和节点退役时需求下降时正常关闭。

In each docker image I placed code to report back if it shutdown because it completed the work and if a SIGTERM event took place and thus completed its processing and terminated.在每个 docker 图像中，我都放置了代码来报告它是否因为它完成了工作而关闭，以及是否发生了 SIGTERM 事件并因此完成了它的处理并终止了。

My system is deployed in AWS using EKS.我的系统使用 EKS 部署在 AWS 中。 I use EKS to manage node deployment when demand goes up and spindown nodes when demand drops.我使用 EKS 在需求上升时管理节点部署，在需求下降时管理节点停转。 I use KEDA to manage POD deployment which is what helps trigger whether additional nodes are needed or not.我使用 KEDA 来管理 POD 部署，这有助于触发是否需要额外的节点。 In KEDA I have the cooldownPeriod defined for 2 hours the maximum I expect a pod to take even though the max it would ever take is 1 hour.在 KEDA 中，我将 cooldownPeriod 定义为 2 小时，这是我希望 pod 花费的最大值，即使它花费的最大值是 1 小时。

In AWS EKS, I have defined the terminationGracePeriodSeconds for 2 hours as well.在 AWS EKS 中，我也定义了 2 小时的 terminationGracePeriodSeconds。

I isolated the issue during Node scale down that when nodes are being terminated, the terminationGracePeriodSeconds is not being honored and my Pods are being shutdown within ~30 minutes.我在节点缩减期间隔离了这个问题，当节点被终止时，terminationGracePeriodSeconds 没有得到遵守，我的 Pod 在大约 30 分钟内被关闭。 Because the Pods are abruptly removed I am unable to look at their logs to see what happened.因为 Pod 被突然移除，所以我无法查看它们的日志以了解发生了什么。

I tried to simulate this issue by issuing a kube.netes node drain and kept my pod running我试图通过发出 kube.netes 节点耗尽来模拟这个问题并让我的 pod 保持运行

kubectl drain <MY NODE>

I saw the SIGTERM come through, and I also noticed that the pod was only terminated after 2 hours and not before.我看到 SIGTERM 通过，我还注意到 pod 仅在 2 小时后而不是之前终止。

So for a brief minute I thought maybe I did not configure the terminationGracePeriod properly, so I checked:所以有那么一小会儿我想也许我没有正确配置 terminationGracePeriod，所以我检查了：

kubectl get deployment test-mypod -o yaml|grep terminationGracePeriodSeconds
  terminationGracePeriodSeconds: 7200

I even redeployed the config but that made no difference.我什至重新部署了配置，但这没有任何区别。

However, I was able to reproduce the issue by modifying the desiredSize of the Node group.但是，我能够通过修改节点组的 desiredSize 来重现该问题。 I can reproduce it programmatically in Python by doing this:我可以通过这样做以编程方式在 Python 中重现它：

        resp = self.eks_client.update_nodegroup_config(clusterName=EKS_CLUSTER_NAME,
                                                       nodegroupName=EKS_NODE_GROUP_NAME,
                                                       scalingConfig={'desiredSize': configured_desired_size})

or by simply going to AWS console and modifying the desiredSize there.或者只需转到 AWS 控制台并在那里修改 desiredSize。

I see EKS choosing a node and if it happens that there is a pod processing data that will take about an hour, the pod is sometimes prematurely terminated.我看到 EKS 选择一个节点，如果碰巧有一个 pod 处理数据大约需要一个小时，那么 pod 有时会提前终止。

I have logged on to that node that is being scaled down and found no evidence of the prematurely terminated Pod in the logs.我已经登录到正在缩小的那个节点，但在日志中没有发现 Pod 过早终止的证据。

I was able to capture this information once我曾经能够捕捉到这些信息

kubectl get events | grep test-mypod-b8dfc4665-zp87t
54m         Normal    Pulling    pod/test-mypod-b8dfc4665-zp87t         Pulling image ...
54m         Normal    Pulled     pod/test-mypod-b8dfc4665-zp87t         Successfully pulled image ...
54m         Normal    Created    pod/test-mypod-b8dfc4665-zp87t         Created container mypod
54m         Normal    Started    pod/test-mypod-b8dfc4665-zp87t         Started container mypod
23m         Normal    ScaleDown  pod/test-mypod-b8dfc4665-zp87t         deleting pod for node scale down
23m         Normal    Killing    pod/test-mypod-b8dfc4665-zp87t         Stopping container mypod
13m         Warning   FailedKillPod   pod/test-po-b8dfc4665-zp87t       error killing pod: failed to "KillContainer" for "mypod" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"

I once saw a pod removed for no reason as such when scaledown was disabled but it decided to remove my pod:我曾经看到一个 pod 被无缘无故地删除，当 scaledown 被禁用时，但它决定删除我的 pod：

kubectl get events | grep test-mypod-b8dfc4665-vxqhv
45m         Normal    Pulling    pod/test-mypod-b8dfc4665-vxqhv Pulling image ...
45m         Normal    Pulled     pod/test-mypod-b8dfc4665-vxqhv Successfully pulled image ...
45m         Normal    Created    pod/test-mypod-b8dfc4665-vxqhv Created container mypod
45m         Normal    Started    pod/test-mypod-b8dfc4665-vxqhv Started container mypod
40m         Normal    Killing    pod/test-mypod-b8dfc4665-vxqhv Stopping container mypod

This is the kuber.nets version I have这是我的 kuber.nets 版本

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0" GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-25T14:58:59Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.20-eks-8c49e2", GitCommit:"8c49e2efc3cfbb7788a58025e679787daed22018", GitTreeState:"clean", BuildDate:"2021-10-17T05:13:46Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

To minimize this issue, I deployed a Pod Disruption Budget during peak hours to block scale down and in the evening during low demand I remove the PDB which initiates the scaledown.为了最大程度地减少此问题，我在高峰时段部署了 Pod 中断预算以阻止缩减，并在晚上低需求期间删除了启动缩减的 PDB。 However, that is not the right solution and even during low peak there are still pods that get stopped prematurely.然而，这不是正确的解决方案，即使在低峰期间，仍然有 pod 过早停止。

Answer 1

When using Amazon EKS, the node autoscaler does not honor the terminationGracePeriodSeconds.使用 Amazon EKS 时，节点自动扩缩器不遵守 TerminationGracePeriodSeconds。 Per每

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#does-ca-respect-gracefultermination-in-scale-down https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#does-ca-respect-gracefultermination-in-scale-down

The Node Autoscaler only provides a 10 minute grace period. Node Autoscaler 仅提供 10 分钟的宽限期。 I extracted the relevant text here:我在这里提取了相关文本：

How fast is Cluster Autoscaler? Cluster Autoscaler 有多快？

By default, scale-up is considered up to 10 seconds after pod is marked as unschedulable, and scale-down 10 minutes after a node becomes unneeded.默认情况下，在 pod 标记为不可调度后的 10 秒内考虑纵向扩展，并在节点变为不需要节点后的 10 分钟内缩减。 There are multiple flags which can be used to configure these thresholds.有多个标志可用于配置这些阈值。 For example, in some environments, you may wish to give the k8s scheduler a bit more time to schedule a pod than the CA's scan-interval.例如，在某些环境中，您可能希望给 k8s 调度程序比 CA 的扫描间隔多一点的时间来调度 Pod。 One way to do this is by setting --new-pod-scale-up-delay, which causes the CA to ignore unschedulable pods until they are a certain "age", regardless of the scan-interval.一种方法是设置 --new-pod-scale-up-delay，这会导致 CA 忽略不可调度的 pod，直到它们达到某个“年龄”，而不管扫描间隔如何。 If k8s has not scheduled them by the end of that delay, then they may be considered by the CA for a possible scale-up.如果 k8s 在延迟结束时还没有安排它们，那么 CA 可能会考虑它们以进行可能的扩展。

Another relevant link: https://github.com/kubernetes/autoscaler/issues/147另一个相关链接： https ://github.com/kubernetes/autoscaler/issues/147

I implemented a script to be invoked as a preStop Hook that will hopefully block the next state that issues the SIGTERM and starts the 10 minute countdown to give me a chance to gracefully shutdown my service.我实现了一个脚本作为 preStop Hook 调用，它有望阻止发出 SIGTERM 的下一个状态并开始 10 分钟倒计时，让我有机会优雅地关闭我的服务。 However, the preStopHook does not delay the 10 minute timer.但是，preStopHook 不会延迟 10 分钟计时器。

Some references to that setup:对该设置的一些参考：

https://www.ithands-on.com/2021/07/kubernetes-101-pods-lifecycle-hooks_30.html https://www.ithands-on.com/2021/07/kubernetes-101-pods-lifecycle-hooks_30.html

https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/ https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/

Instead, I added the following annotation to my pod deployment config, per the following reference:相反，我根据以下参考在我的 pod 部署配置中添加了以下注释：

https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/#prevent-scale-down-eviction https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/#prevent-scale-down-eviction

template:
  metadata:
    labels:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'false'

Then I ensured that my my pods become on demand pods, ie no pods are deployed idle as idle pods impact EKS scale down and only spawned when needed and shutdown when their task is done.然后我确保我的 pod 成为按需 pod，即没有 pod 部署空闲，因为空闲 pod 会影响 EKS 规模缩小，并且仅在需要时产生并在其任务完成时关闭。 This slows my response time for jobs, but that is a smaller price to pay relative to shutting down a Pod amid an expensive compute.这会减慢我对工作的响应时间，但与在昂贵的计算中关闭 Pod 相比，这是一个较小的代价。

In case anyone is curious on how to deploy an AWS Cluster Autoscaler: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler如果有人对如何部署 AWS Cluster Autoscaler 感到好奇： https ://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html#cluster-autoscaler

It has a reference on also disabling evictions of Pods它还提供了关于禁用 Pod 驱逐的参考

Under load we are still seeing that the safe-to-evict annotation is not being honored and reported this back to Amazon AWS.在负载下，我们仍然看到 safe-to-evict 注释没有被兑现，并将其报告给 Amazon AWS。 With additional debugging I was able to discover that EKS is seeing nodes hosting the pods disappearing despite EKS ignoring nodes with the safe to evict.通过额外的调试，我发现 EKS 看到托管 pod 的节点正在消失，尽管 EKS 忽略了可以安全驱逐的节点。 There might be an interoperability issue between EKS and EC2. EKS 和 EC2 之间可能存在互操作性问题。 Until this is resolved I am looking into using Fargate as an alternate autoscaler.在解决此问题之前，我正在考虑使用 Fargate 作为备用自动扩缩器。

Answer 2

We faced the same issue with AWS EKS and cluster-autoscaler - nodes were unexpectedly shut down, no preventive actions were working, and even the node annotation cluster-autoscaler.kubernetes.io/scale-down-disabled=true did not make any difference.我们在使用 AWS EKS 和 cluster-autoscaler 时遇到了同样的问题 - 节点意外关闭，没有任何预防措施起作用，甚至节点注释cluster-autoscaler.kubernetes.io/scale-down-disabled=true也没有任何区别.

After two days of troubleshooting, we found the reason - it was because we use Multiple Availability Zone in ASG configuration, which has an automatic "AZRebalance" process.经过两天的排查，我们找到了原因——因为我们在 ASG 配置中使用了 Multiple Availability Zone，它有一个自动的“AZRebalance”过程。 The AZRebalance tries to ensure that the number of nodes is approximately the same between all availability zones. AZRebalance 尝试确保所有可用区之间的节点数大致相同。 Therefore, sometimes when the scale-up event occurs, it tries to rebalance nodes by killing one node and creating another in a different time zone.因此，有时当扩展事件发生时，它会尝试通过杀死一个节点并在不同时区创建另一个节点来重新平衡节点。 The message in the events log looks like this:事件日志中的消息如下所示：

Cluster-autoscaler does not control this process, so there are two systems (cluster-autoscaler and AWS ASG) that manage the number of nodes simultaneously, which leads to unexpected behavior. Cluster-autoscaler 不控制此过程，因此有两个系统（cluster-autoscaler 和 AWS ASG）同时管理节点数量，这会导致意外行为。

As a workaround, we suspended the "AZRebalance" process in the ASG.作为一种解决方法，我们暂停了 ASG 中的“AZRebalance”进程。

Another solution would be to use ASG for each availability zone separately and use --balance-similar-node-groups feature in the cluster-autoscaler.另一种解决方案是分别为每个可用区使用 ASG，并在集群自动缩放器中使用--balance-similar-node-groups 功能。

Here's the article about that and here's the cluster-autoscaler documentation. 这是关于此的文章，这里是cluster-autoscaler 文档。

Answer 3

We worked with Amazon support to solve this issue.我们与亚马逊支持合作解决了这个问题。 The final resolution was not far from @lub0v answer but there was still a missing component.最终的解决方案与@lub0v 的答案相差不远，但仍然缺少一个组件。

Our EKS system had only one node group that managed multiple Availability Zones.我们的 EKS 系统只有一个管理多个可用区的节点组。 Instead I deployed one node group per Availability Zone.相反，我为每个可用区部署了一个节点组。 Once we did that the TerminationGracePeriod was being honored.一旦我们这样做了，TerminationGracePeriod 就被兑现了。

Also, don't forget prior answers I added earlier, ensure your pod annotation contains safe-to-evict set as false另外，不要忘记我之前添加的先前答案，确保您的 pod 注释包含 safe-to-evict 设置为 false

Finaly, use --balance-similar-node-groups in your cluster autoscaler command line parameter if you prefer to have the same number of nodes deployed during upscaling.最后，如果您希望在升级期间部署相同数量的节点，请在集群自动缩放器命令行参数中使用 --balance-similar-node-groups 。 Currently this parameter is not honored during downscaling.目前这个参数在缩减尺度时不被接受。

Reference on autoscaling: https://github.com/kube.netes/autoscaler/blob/master/cluster-autoscaler/FAQ.md自动缩放参考： https://github.com/kube.netes/autoscaler/blob/master/cluster-autoscaler/FAQ.md

Kube.netes AutoScaler 或更改 AWS 中的 Desired Nodes 过早终止 Docker Pod

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-03-08 04:46:54

解决方案2
0 2022-07-18 17:52:48

解决方案3
0 2022-11-19 14:00:52

Kube.netes AutoScaler 或更改 AWS 中的 Desired Nodes 过早终止 Docker Pod

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-03-08 04:46:54

解决方案2 0 2022-07-18 17:52:48

解决方案3 0 2022-11-19 14:00:52

解决方案1
1 已采纳 2022-03-08 04:46:54

解决方案2
0 2022-07-18 17:52:48

解决方案3
0 2022-11-19 14:00:52