简体   繁体   中英

Azure Kubernetes Service (AKS) no longer able to create new nodepools

A few days ago our AKS cluster suffered a "downtime" of the backend which was confirmed by a support engineer on the Azure team. The primary effect of this downtime seems to have affected our cluster's LoadBalancer specifically. I noticed the error for the first time when I went to create a new nodepool on our cluster which failed with an error message which states:

{
  "status": "Failed",
  "error": {
    "code": "ResourceOperationFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "ReconcileStandardLoadBalancerError",
        "message": "Reconcile standard load balancer failed. Details: outboundReconciler retry failed: Category: ClientError; SubCode: InvalidRequestFormat_DuplicateResourceName; Dependency: Microsoft.Network/LoadBalancers; OrginalError: Code=\"InvalidRequestFormat\" Message=\"Cannot parse the request.\" Details=[{\"code\":\"DuplicateResourceName\",\"message\":\"Resource /subscriptions//resourceGroups//providers/Microsoft.Network/loadBalancers/ has two child resources with the same name (REDACTED-PUBLIC-IP-RESOURCE-NAME).\"}]; AKSTeam: Networking."
      }
    ]
  }
}

We have been completely unsuccessful since this error occurred in creating a new node pool on this cluster.

As far as I can tell the resource it is referencing which is a public IP address is not duplicated, but I truly don't really understand the error response at all.

I've been in touch with the support team for AKS but they seem to be at a loss as well and are recommending just to update the existing node image versions, which I am 99% sure won't fix this. I'm pretty stuck with trying to fix this and don't fully understand what the actual issue is. Any help would be hugely appreciated even if it's just a similar experience with an error such as this one.

Thanks.

My reading of that error is that AKS doesn't recognize the public IP is there, so is trying to create it again. It fails, so when you look, there's only one.

I'd try the following, in order.

  1. Ensure the permissions on all of the resources are correct, and include the AKS cluster account. If they look good, I'd even consider giving an additional Reader access over everything. This is based on the assumption that if AKS could see the existing resource, it wouldn't try to create it.
  2. Remove the public IP configuration from the loadbalancer. AKS is failing to create it because of a duplicate name, so removing it should remove the conflict.
  3. Remove the public IP resource. Similar rationale as (2).
  4. Remove the Loadbalancer.

Caveat: These are in increasing order of risk to your existing cluster, and may result in a change in the public IP address, complete ingress failure, or worse. I would (ok.. I wouldn't, but you should) discuss these with the support team before you attempt.

-Dave

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM