简体   繁体   English

Azure Kubernetes 服务 (AKS) 不再能够创建新的节点池

[英]Azure Kubernetes Service (AKS) no longer able to create new nodepools

A few days ago our AKS cluster suffered a "downtime" of the backend which was confirmed by a support engineer on the Azure team.几天前,我们的 AKS 集群遭遇了后端“停机”,Azure 团队的一名支持工程师证实了这一点。 The primary effect of this downtime seems to have affected our cluster's LoadBalancer specifically.这种停机的主要影响似乎特别影响了我们集群的 LoadBalancer。 I noticed the error for the first time when I went to create a new nodepool on our cluster which failed with an error message which states:当我在集群上创建一个新的节点池时,我第一次注意到这个错误,该节点池失败并显示一条错误消息:

{
  "status": "Failed",
  "error": {
    "code": "ResourceOperationFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "ReconcileStandardLoadBalancerError",
        "message": "Reconcile standard load balancer failed. Details: outboundReconciler retry failed: Category: ClientError; SubCode: InvalidRequestFormat_DuplicateResourceName; Dependency: Microsoft.Network/LoadBalancers; OrginalError: Code=\"InvalidRequestFormat\" Message=\"Cannot parse the request.\" Details=[{\"code\":\"DuplicateResourceName\",\"message\":\"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/loadBalancers/ has two child resources with the same name (REDACTED-PUBLIC-IP-RESOURCE-NAME).\"}]; AKSTeam: Networking."
      }
    ]
  }
}

We have been completely unsuccessful since this error occurred in creating a new node pool on this cluster.由于在此集群上创建新节点池时发生此错误,因此我们完全没有成功。

As far as I can tell the resource it is referencing which is a public IP address is not duplicated, but I truly don't really understand the error response at all.据我所知,它引用的资源是公共 IP 地址没有重复,但我真的根本不了解错误响应。

I've been in touch with the support team for AKS but they seem to be at a loss as well and are recommending just to update the existing node image versions, which I am 99% sure won't fix this.我一直在与 AKS 的支持团队联系,但他们似乎也不知所措,并建议只更新现有的节点映像版本,我 99% 肯定不会解决这个问题。 I'm pretty stuck with trying to fix this and don't fully understand what the actual issue is.我一直在努力解决这个问题,并且不完全了解实际问题是什么。 Any help would be hugely appreciated even if it's just a similar experience with an error such as this one.任何帮助都将不胜感激,即使它只是类似的错误体验,例如这个。

Thanks.谢谢。

My reading of that error is that AKS doesn't recognize the public IP is there, so is trying to create it again.我对该错误的解读是 AKS 无法识别存在公共 IP,因此尝试再次创建它。 It fails, so when you look, there's only one.它失败了,所以当你看的时候,只有一个。

I'd try the following, in order.我会按顺序尝试以下操作。

  1. Ensure the permissions on all of the resources are correct, and include the AKS cluster account.确保对所有资源的权限都正确,并包括 AKS 群集帐户。 If they look good, I'd even consider giving an additional Reader access over everything.如果它们看起来不错,我什至会考虑为所有内容提供额外的Reader访问权限。 This is based on the assumption that if AKS could see the existing resource, it wouldn't try to create it.这是基于这样的假设:如果 AKS 可以看到现有资源,它就不会尝试创建它。
  2. Remove the public IP configuration from the loadbalancer.从负载均衡器中删除公共 IP 配置。 AKS is failing to create it because of a duplicate name, so removing it should remove the conflict.由于名称重复,AKS 无法创建它,因此删除它应该会消除冲突。
  3. Remove the public IP resource.删除公共 IP 资源。 Similar rationale as (2).与 (2) 类似的基本原理。
  4. Remove the Loadbalancer.删除负载均衡器。

Caveat: These are in increasing order of risk to your existing cluster, and may result in a change in the public IP address, complete ingress failure, or worse.警告:这些对您现有集群的风险依次增加,并且可能导致公共 IP 地址更改、完全入口故障或更糟。 I would (ok.. I wouldn't, but you should) discuss these with the support team before you attempt.在您尝试之前,我会(好吧……我不会,但您应该)与支持团队讨论这些问题。

-Dave -戴夫

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM