如何解决 AWS EKS 中的 PodEvictionFailure 错误？

Question

我正在尝试升级 AWS EKS 中的节点组。 我正在使用 CDK，但出现以下错误

Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)

根据 aws doc，如果部署容忍每个污点，则可能会发生PodEvictionFailure ，并且节点永远不会变空。

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade

Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.

我检查了我的节点和节点上运行的所有 pod，发现以下 pod 可以容忍所有污点。

以下两个 pod 具有以下容差。

Pod：kube-system/aws-node-pdmbh
Pod：kube-system/kube-proxy-7n2kf

{
  ...
  ...

  "tolerations": [
    {
      "operator": "Exists"
    },
    {
      "key": "node.kubernetes.io/not-ready",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/unreachable",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/disk-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/memory-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/pid-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/unschedulable",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/network-unavailable",
      "operator": "Exists",
      "effect": "NoSchedule"
    }
  ]
}

我是否需要更改这些 pod 的容忍度以避免容忍所有污点？ 如果是，如何，因为这些是由 AWS 管理的 pod。

我怎样才能避免PodEvictionFailure ？

Answer 1

正如@Ola Ekdahl 所建议的，同样在您共享的 Amazon AWS 文档中 - 最好使用force标志而不是更改 pod 的容忍度。 请参阅： https ://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html（“升级阶段”#2）

您可以像下面这样添加force标志，看看是否有帮助：

new eks.Nodegroup(this, 'myNodeGroup', {
  cluster: this.cluster,
  forceUpdate: true,
  releaseVersion: '<AMI ID obtained from changelog>',
  ...
});

如何解决 AWS EKS 中的 PodEvictionFailure 错误？

问题描述

1 个解决方案

解决方案1
0 2022-12-21 04:48:11

如何解决 AWS EKS 中的 PodEvictionFailure 错误？

问题描述

1 个解决方案

解决方案1 0 2022-12-21 04:48:11

解决方案1
0 2022-12-21 04:48:11