如何解决 AWS EKS 中的 PodEvictionFailure 错误？

Question

I am trying to upgrade my node group in AWS EKS.我正在尝试升级 AWS EKS 中的节点组。 I am using CDK and I am getting the following error我正在使用 CDK，但出现以下错误

Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)

According to aws doc, PodEvictionFailure can occur if the deployment tolerates every taint, and the node can never become empty.根据 aws doc，如果部署容忍每个污点，则可能会发生PodEvictionFailure ，并且节点永远不会变空。

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade

Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.

I checked my nodes and all the pods running on the node and found the following pods which tolerates every taint.我检查了我的节点和节点上运行的所有 pod，发现以下 pod 可以容忍所有污点。

both of the following pods have the following tolerations.以下两个 pod 具有以下容差。

Pod: kube-system/aws-node-pdmbh Pod：kube-system/aws-node-pdmbh
Pod: kube-system/kube-proxy-7n2kf Pod：kube-system/kube-proxy-7n2kf

{
  ...
  ...

  "tolerations": [
    {
      "operator": "Exists"
    },
    {
      "key": "node.kubernetes.io/not-ready",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/unreachable",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/disk-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/memory-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/pid-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/unschedulable",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/network-unavailable",
      "operator": "Exists",
      "effect": "NoSchedule"
    }
  ]
}

Do I need to change the tolerations of these pods to avoid tolerating all taints?我是否需要更改这些 pod 的容忍度以避免容忍所有污点？ If so, how, as these are pods managed by AWS.如果是，如何，因为这些是由 AWS 管理的 pod。

How can I avoid PodEvictionFailure ?我怎样才能避免PodEvictionFailure ？

Answer 1

As suggested by @Ola Ekdahl, also in Amazon AWS doc you shared - it's better to use force flag rather than change the tolerations for the pods.正如@Ola Ekdahl 所建议的，同样在您共享的 Amazon AWS 文档中 - 最好使用force标志而不是更改 pod 的容忍度。 See: https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html ("Upgrade phase" #2)请参阅： https ://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html（“升级阶段”#2）

You can add the force flag like following and see if that helps:您可以像下面这样添加force标志，看看是否有帮助：

new eks.Nodegroup(this, 'myNodeGroup', {
  cluster: this.cluster,
  forceUpdate: true,
  releaseVersion: '<AMI ID obtained from changelog>',
  ...
});

如何解决 AWS EKS 中的 PodEvictionFailure 错误？

问题描述

1 个解决方案

解决方案1
0 2022-12-21 04:48:11

如何解决 AWS EKS 中的 PodEvictionFailure 错误？

问题描述

1 个解决方案

解决方案1 0 2022-12-21 04:48:11

解决方案1
0 2022-12-21 04:48:11