GKE Kubernetes Autoscaler - 最大集群 cpu，达到内存限制

Question

GKE Autoscaler is not scaling nodes up after 15 nodes (former limit) GKE Autoscaler 不会在 15 个节点之后扩展节点（以前的限制）

I've changed the Min and Max values in Cluster to 17-25我已将 Cluster 中的Min和Max值更改为 17-25

However the node count is stuck on 14-15 and is not going up, right now my cluster is full, no more pods can fit in, so every new deployment should trigger node scale up and schedule itself onto the new node, which is not happening.然而，节点数停留在 14-15 并且没有增加，现在我的集群已满，没有更多的 pod 可以容纳，所以每个新部署都应该触发节点扩展并将自己安排到新节点上，这不是发生。

When I create deployment it is stuck in Pending state with a message:当我创建部署时，它停留在Pending状态并显示一条消息：

pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max cluster cpu, memory limit reached

Max cluster cpu, memory limit reached sounds like the maximum node count is somehow still 14-15, how is that possible?最大集群 cpu，达到内存限制听起来最大节点数仍然是 14-15，这怎么可能？ Why it is not triggering node scale up?为什么它不会触发节点扩展？

ClusterAutoscalerStatus: ClusterAutoscaler 状态：

apiVersion: v1
data:
  status: |+
    Cluster-autoscaler status at 2020-03-10 10:35:39.899329642 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:11.965623459 +0000 UTC m=+4133.007827509
      ScaleUp:     NoActivity (ready=14 registered=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 08:40:47.775200087 +0000 UTC m=+28.817404126
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

    NodeGroups:
      Name:        https://content.googleapis.com/compute/v1/projects/project/zones/europe-west4-b/instanceGroups/adjust-scope-bff43e09-grp
      Health:      Healthy (ready=14 unready=0 notStarted=0 longNotStarted=0 registered=14 longUnregistered=0 cloudProviderTarget=14 (minSize=17, maxSize=25))
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleUp:     NoActivity (ready=14 cloudProviderTarget=14)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:46:19.45614781 +0000 UTC m=+3960.498351857
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2020-03-10 10:35:39.608193389 +0000 UTC m=+6920.650397445
                   LastTransitionTime: 2020-03-10 09:49:49.580623718 +0000 UTC m=+4170.622827779

Deployment is very small!部署非常小！ (200m CPU, 256Mi mem) so it will surely fit if new node would be added. （200m CPU，256Mi mem）所以如果添加新节点肯定会适合。

Looks like a bug in nodepool/autoscaler as 15 was my former node count limit, somehow it looks like it still things 15 is top.看起来像 nodepool/autoscaler 中的一个错误，因为 15 是我以前的节点数限制，不知何故，它看起来仍然是 15 是最高的。

EDIT: New nodepool with bigger machines, autoscaling in GKE turned on, still the same issue after some time, even though the nodes are having free resources.编辑：具有更大机器的新节点池，GKE 中的自动缩放已打开，一段时间后仍然存在相同的问题，即使节点拥有免费资源。 Top from nodes:来自节点的顶部：

NAME                                                  CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-infrastructure-n-autoscaled-node--0816b9c6-fm5v   805m         41%    4966Mi          88%       
gke-infrastructure-n-autoscaled-node--0816b9c6-h98f   407m         21%    2746Mi          48%       
gke-infrastructure-n-autoscaled-node--0816b9c6-hr0l   721m         37%    3832Mi          67%       
gke-infrastructure-n-autoscaled-node--0816b9c6-prfw   1020m        52%    5102Mi          90%       
gke-infrastructure-n-autoscaled-node--0816b9c6-s94x   946m         49%    3637Mi          64%       
gke-infrastructure-n-autoscaled-node--0816b9c6-sz5l   2000m        103%   5738Mi          101%      
gke-infrastructure-n-autoscaled-node--0816b9c6-z6dv   664m         34%    4271Mi          75%       
gke-infrastructure-n-autoscaled-node--0816b9c6-zvbr   970m         50%    3061Mi          54%

And yet Still the message 1 max cluster cpu, memory limit reached .然而仍然是消息1 max cluster cpu, memory limit reached 。 This is still happening when updating a deployment, the new version sometimes stuck in Pending because it wont trigger the scale up.更新部署时仍然会发生这种情况，新版本有时会停留在Pending因为它不会触发扩展。

EDIT2: While describing cluster with cloud command, I've found this: EDIT2：在使用 cloud 命令描述集群时，我发现了这一点：

autoscaling:
  autoprovisioningNodePoolDefaults:
    oauthScopes:
    - https://www.googleapis.com/auth/logging.write
    - https://www.googleapis.com/auth/monitoring
    serviceAccount: default
  enableNodeAutoprovisioning: true
  resourceLimits:
  - maximum: '5'
    minimum: '1'
    resourceType: cpu
  - maximum: '5'
    minimum: '1'
    resourceType: memory

How is this working with autoscaling turned on?这如何在启用自动缩放的情况下工作？ It wont trigger scaleup if those are reached?如果达到这些，它不会触发放大？ (The sum is already are above that) （总和已经超过了）

Answer 1

I ran into the same issue and was bashing my head against the wall trying to figure out what was going on.我遇到了同样的问题，正用头撞墙试图弄清楚发生了什么。 Even support couldn't figure it out.连支持也搞不清楚。

The issue is that if you enable node auto-provisioning at the cluster level, you are setting the actual min/max cpu and mem allowed for the entire cluster.问题是，如果您在集群级别启用节点自动配置，您正在设置整个集群允许的实际最小/最大 cpu 和内存。 At first glance the UI seems to be suggesting the min/max cpu and mem you would want per node that is auto-provisioned - but that is not correct.乍一看，用户界面似乎在建议每个自动配置的节点所需的最小/最大 cpu 和内存 - 但这是不正确的。 So if for example you wanted a maximum of 100 nodes with 8 CPU per node, then your max CPU should be 800. I know a maximum for the cluster is obviously useful so things don't get out of control, but the way it is presented is not intuitive.因此，例如，如果您想要最多 100 个节点，每个节点有 8 个 CPU，那么您的最大 CPU 应该是 800。我知道集群的最大值显然很有用，因此事情不会失控，但它的方式是呈现的不直观。 Since you actually don't have control over what gets picked for your machine type, don't you think it would be useful to not let kubernetes pick a 100 core machine for a 1 core task?由于您实际上无法控制为您的机器类型选择什么，您不认为不让 kubernetes 为 1 核任务选择 100 核机器会很有用吗？ that is what I thought it was asking when I was configuring it.这就是我在配置它时认为它问的问题。

Node auto-provisioning is useful because if for some reason you have auto-provisioning on your own node pool, sometimes it can't meet your demands due to quota issues, then the cluster level node auto-provisioner will figure out a different node pool machine type that it can provision to meet your demands.节点自动配置很有用，因为如果由于某种原因您在自己的节点池上进行了自动配置，有时由于配额问题无法满足您的需求，那么集群级别的节点自动配置器会找出不同的节点池它可以配置以满足您的需求的机器类型。 In my scenario I was using C2 CPUs and there was a scarcity of those cpus in the region so my node pool stopped auto-scaling.在我的场景中，我使用的是 C2 CPU，并且该地区缺少这些 CPU，因此我的节点池停止了自动缩放。

To make things even more confusing, most people start with specifying their node pool machine type, so they are already used to customzing these limits on a per node basis.更令人困惑的是，大多数人从指定他们的节点池机器类型开始，因此他们已经习惯于在每个节点的基础上自定义这些限制。 But then something stops working like a quota issue you have no idea about so you get desperate and configure the node auto-provisioner at the cluster level but then get totally screwed because you thought you were specifying the limits for the new potential machine type.但是随后某些事情停止工作，例如您不知道的配额问题，因此您绝望并在集群级别配置节点自动配置程序，但随后完全搞砸了，因为您认为您正在为新的潜在机器类型指定限制。

Hopefully this helps clear some things up.希望这有助于澄清一些事情。

Answer 2

Can you please check if you didn't reach your project quotas?你能检查一下你是否没有达到你的项目配额吗？ Like, try to manually create VM.比如，尝试手动创建 VM。 If not related to quota, can you specify GKE version you use?如果与配额无关，能否指定您使用的 GKE 版本？

GKE Kubernetes Autoscaler - 最大集群 cpu，达到内存限制

问题描述

2 个解决方案

解决方案1
12 2020-06-20 21:11:53

解决方案2
0 2020-03-10 15:22:15

GKE Kubernetes Autoscaler - 最大集群 cpu，达到内存限制

问题描述

2 个解决方案

解决方案1 12 2020-06-20 21:11:53

解决方案2 0 2020-03-10 15:22:15

解决方案1
12 2020-06-20 21:11:53

解决方案2
0 2020-03-10 15:22:15