繁体   English   中英

如何解决 AWS EKS 中的 PodEvictionFailure 错误?

[英]How do I resolve PodEvictionFailure error in AWS EKS?

我正在尝试升级 AWS EKS 中的节点组。 我正在使用 CDK,但出现以下错误

Resource handler returned message: "[ErrorDetail(ErrorCode=PodEvictionFailure, ErrorMessage=Reached max retries while trying to evict pods from nodes in node group <node-group-name>, ResourceIds=[<node-name>])] (Service: null, Status Code: 0, Request ID: null)" (RequestToken: <request-token>, HandlerErrorCode: GeneralServiceException)

根据 aws doc,如果部署容忍每个污点,则可能会发生PodEvictionFailure ,并且节点永远不会变空。

https://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html#managed-node-update-upgrade

Deployment tolerating all the taints – Once every pod is evicted, it's expected for the node to be empty because the node is tainted in the earlier steps. However, if the deployment tolerates every taint, then the node is more likely to be non-empty, leading to pod eviction failure.

我检查了我的节点和节点上运行的所有 pod,发现以下 pod 可以容忍所有污点。

以下两个 pod 具有以下容差。

  • Pod:kube-system/aws-node-pdmbh
  • Pod:kube-system/kube-proxy-7n2kf
{
  ...
  ...

  "tolerations": [
    {
      "operator": "Exists"
    },
    {
      "key": "node.kubernetes.io/not-ready",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/unreachable",
      "operator": "Exists",
      "effect": "NoExecute"
    },
    {
      "key": "node.kubernetes.io/disk-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/memory-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/pid-pressure",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/unschedulable",
      "operator": "Exists",
      "effect": "NoSchedule"
    },
    {
      "key": "node.kubernetes.io/network-unavailable",
      "operator": "Exists",
      "effect": "NoSchedule"
    }
  ]
}

我是否需要更改这些 pod 的容忍度以避免容忍所有污点? 如果是,如何,因为这些是由 AWS 管理的 pod。

我怎样才能避免PodEvictionFailure

正如@Ola Ekdahl 所建议的,同样在您共享的 Amazon AWS 文档中 - 最好使用force标志而不是更改 pod 的容忍度。 请参阅: https ://docs.aws.amazon.com/eks/latest/userguide/managed-node-update-behavior.html(“升级阶段”#2)

您可以像下面这样添加force标志,看看是否有帮助:

new eks.Nodegroup(this, 'myNodeGroup', {
  cluster: this.cluster,
  forceUpdate: true,
  releaseVersion: '<AMI ID obtained from changelog>',
  ...
});

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM