简体   繁体   中英

Kubernetes Cluster autoscaler not scaling down instances on EKS - just logs that the node is unneeded

Here are the logs from the autoscaler:

0922 17:08:33.857348       1 auto_scaling_groups.go:102] Updating ASG terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
I0922 17:08:33.857380       1 aws_manager.go:152] Refreshed ASG list, next refresh after 2019-09-22 17:08:43.857375311 +0000 UTC m=+259.289807511
I0922 17:08:33.857465       1 utils.go:526] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0922 17:08:33.857482       1 static_autoscaler.go:261] Filtering out schedulables
I0922 17:08:33.857532       1 static_autoscaler.go:271] No schedulable pods
I0922 17:08:33.857545       1 static_autoscaler.go:279] No unschedulable pods
I0922 17:08:33.857557       1 static_autoscaler.go:333] Calculating unneeded nodes
I0922 17:08:33.857601       1 scale_down.go:376] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0922 17:08:33.857621       1 scale_down.go:407] Node ip-10-0-1-135.us-west-2.compute.internal - utilization 0.055000
I0922 17:08:33.857688       1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s
I0922 17:08:33.857703       1 static_autoscaler.go:360] Scale down status: unneededOnly=true lastScaleUpTime=2019-09-22 17:04:42.29864432 +0000 UTC m=+17.731076395 lastScaleDownDeleteTime=2019-09-22 17:04:42.298645611 +0000 UTC m=+17.731077680 lastScaleDownFailTime=2019-09-22 17:04:42.298646962 +0000 UTC m=+17.731079033 scaleDownForbidden=false isDeleteInProgress=false
I0922 17:08:33.857688       1 static_autoscaler.go:349] ip-10-0-1-135.us-west-2.compute.internal is unneeded since 2019-09-22 17:05:07.299351571 +0000 UTC m=+42.731783882 duration 3m26.405144434s

If it's unneeded, then what is the next step? What is it waiting for?

I've drained one node:

kubectl get nodes -o=wide
NAME                                       STATUS                     ROLES    AGE   VERSION               INTERNAL-IP   EXTERNAL-IP      OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
ip-10-0-0-118.us-west-2.compute.internal   Ready                      <none>   46m   v1.13.10-eks-d6460e   10.0.0.118    52.40.115.132    Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1
ip-10-0-0-211.us-west-2.compute.internal   Ready                      <none>   44m   v1.13.10-eks-d6460e   10.0.0.211    35.166.57.203    Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1
ip-10-0-1-135.us-west-2.compute.internal   Ready,SchedulingDisabled   <none>   46m   v1.13.10-eks-d6460e   10.0.1.135    18.237.253.134   Amazon Linux 2   4.14.138-114.102.amzn2.x86_64   docker://18.6.1

Why is it not terminating the instance?

These are the parameters I'm using:

        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=default
        - --scan-interval=25s
        - --scale-down-unneeded-time=30s
        - --nodes=1:20:terraform-eks-demo20190922161659090500000007--terraform-eks-demo20190922161700651000000008
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-job-runner
        - --logtostderr=true
        - --stderrthreshold=info
        - --v=4

Have you got any of the following?

  • Pods running on that node without a controller object (ie deployment / replica-set?
  • Any kube-system pods that don't have a pod disruption budget
  • Pods with local storage or any custom affinity/anti-affinity/nodeSelectors
  • An annotation set on that node that prevents cluster-autoscaler from scaling it down

Your config/start-up options for CA look good to me though.

I can only imagine it might be something to with a specific pod running on that node. Maybe run through the kube-system pods running on the nodes listed that are not scaling down and check the above list.

These two page sections have some good items to check on that might be causing CA to not scale down nodes.

low utilization nodes but not scaling down, why? what types of pods can prevent CA from removing a node?

Here's what I did to solve this issue:

  1. Tail logs for cluster-autoscaler (I used kubetail since cluster-autoscaler had multiple replicas)
  2. From the AWS console, I found the autoscaling group related to my cluster
  3. Reduced the number of desired nodes of the autoscaling group from the AWS console
  4. Waited until the cluster-autoscaler scaled the cluster down
  5. Waited again until the cluster-autoscaler scaled the cluster up
  6. Found the reason for scaling up in the logs and handled it accordingly

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM