使用ACS引擎在Azure上群集自動縮放器無法從0放大

Question

我正在嘗試使用acs-engine在Azure中設置一個群集，以使用VMSS作為代理池來構建Kubernetes群集。 群集啟動后，我添加了cluster-autoscaler以管理2個專用代理程序池，1個cpu和1個gpu。 只要規模集中仍具有運行中的VM，就可以進行規模縮小和規模擴大。 兩種比例尺都設置為縮小到0。使用ACS，我已經用污點和自定義標簽設置了這兩種比例尺設置。 一旦縮放比例集縮小到0，計划新的Pod時，我將無法使自動縮放器旋轉回一個節點。 我不確定自己在做什么錯，或者不確定是否缺少一些配置，標簽，異味等。我最近才開始使用kubernetes。

下面是我的acs-engine json，pod定義以及自動縮放器和pod的日志描述。

kubectl logs -n kube-system cluster-autoscaler-5967b96496-jnvjr輸出kubectl logs -n kube-system cluster-autoscaler-5967b96496-jnvjr

I0920 16:11:14.925761       1 scale_up.go:249] Pod default/my-test-pod is unschedulable
I0920 16:11:14.999323       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool2-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool2-24760778-vmss-6220731686255962863, reason: node(s) didn't match node selector
I0920 16:11:14.999408       1 utils.go:196] Pod my-test-pod can't be scheduled on k8s-pool3-24760778-vmss, predicate failed: GeneralPredicates predicate mismatch, cannot put default/my-test-pod on template-node-for-k8s-pool3-24760778-vmss-3043543739698957784, reason: node(s) didn't match node selector
I0920 16:11:14.999442       1 scale_up.go:376] No expansion options

來自kubectl describe pod my-test-pod輸出kubectl describe pod my-test-pod

Name:               my-test-pod
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             <none>
Annotations:        kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"my-test-pod","namespace":"default"},"spec":{"affinity":{"nodeAffinity":{"preferred...
Status:             Pending
IP:
Containers:
  my-test-pod:
    Image:      ubuntu:latest
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -ec
      while :; do echo '.'; sleep 5; done
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-qzm6s (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-qzm6s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-qzm6s
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  agentpool=pool2
                 environment=DEV
                 hardware=cpu-spec
                 node-template=k8s-pool2-24760778-vmss
                 vmSize=Standard_D4s_v3
Tolerations:     dedicated=pool2:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Warning  FailedScheduling   2m (x273 over 17m)  default-scheduler   0/3 nodes are available: 3 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  2m (x89 over 17m)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added)

ACS引擎配置文件（使用Terraform渲染和生成）

{
    "apiVersion": "vlabs",
    "properties": {
      "orchestratorProfile": {
        "orchestratorType": "Kubernetes",
        "orchestratorRelease": "1.11",
        "kubernetesConfig": {
          "networkPlugin": "azure",
          "clusterSubnet": "${cidr}",
          "privateCluster": {
            "enabled": true
          },
          "addons": [
            {
              "name": "nvidia-device-plugin",
              "enabled": true
            },
            {
              "name": "cluster-autoscaler",
              "enabled": true,
              "config": {
                "minNodes": "0",
                "maxNodes": "2",
                "image": "gcr.io/google-containers/cluster-autoscaler:1.3.1"
              }
            }
          ]
        }
      },
      "masterProfile": {
        "count": ${master_vm_count},
        "dnsPrefix": "${dns_prefix}",
        "vmSize": "${master_vm_size}",
        "storageProfile": "ManagedDisks",
        "vnetSubnetId": "${pool_subnet_id}",
        "firstConsecutiveStaticIP": "${first_master_ip}",
        "vnetCidr": "${cidr}"
      },
      "agentPoolProfiles": [
        {
          "name": "pool3",
          "count": ${dedicated_vm_count},
          "vmSize": "${dedicated_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "customNodeLabels": {
              "vmSize":"${dedicated_vm_size}",
              "dedicatedOnly": "true",
              "environment":"${environment}",
              "hardware": "${dedicated_spec}"
          },
          "availabilityProfile": "VirtualMachineScaleSets",
          "scaleSetEvictionPolicy": "Delete",
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool3:NoSchedule"
            }
          }
        },
        {
          "name": "pool2",
          "count": ${pool2_vm_count},
          "vmSize": "${pool2_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool2_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          },
          "kubernetesConfig": {
            "kubeletConfig": {
              "--register-with-taints": "dedicated=pool2:NoSchedule"
            }
          }
    },
        {
          "name": "pool1",
          "count": ${pool1_vm_count},
          "vmSize": "${pool1_vm_size}",
          "storageProfile": "ManagedDisks",
          "OSDiskSizeGB": 31,
          "vnetSubnetId": "${pool_subnet_id}",
          "availabilityProfile": "VirtualMachineScaleSets",
          "customNodeLabels": {
              "vmSize":"${pool1_vm_size}",
              "environment":"${environment}",
              "hardware": "${pool_spec}"
          }
        }
      ],
      "linuxProfile": {
        "adminUsername": "${admin_user}",
        "ssh": {
          "publicKeys": [
            {
              "keyData": "${ssh_key}"
            }
          ]
        }
      },
      "servicePrincipalProfile": {
        "clientId": "${service_principal_client_id}",
        "secret": "${service_principal_client_secret}"
      }
    }
  }

Pod配置文件

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: vmSize
            operator: In
            values:
              - Standard_D4s_v3
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never

我試過在nodeAffinity / nodeSelector / Tolerations中添加和刪除它們的變體，所有變體都具有相同的結果。

群集啟動后，我確實將pool2添加到自動縮放器中。 在Internet上尋找解決方案時，我會不斷瀏覽有關節點模板標簽的文章，我認為形式是k8s.io/autoscaler/cluster-autoscaler/node-template/label/value，但這似乎是必需的適用於AWS。

誰能在Azure上為我提供任何指導？

謝謝。

Answer 1

更新。

我已經找到答案。 通過刪除requiredDuringSchedulingIgnoreDuringExecution節點關聯性規則，並僅使用preferredDuringSchedulingIgnoreDuringExecution ，調度程序就可以在規模集中正確啟動新的VM。

apiVersion: v1
kind: Pod
metadata:
  name: my-test-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: hardware
                operator: In
                values:
                - cpu-spec
  nodeSelector:
    agentpool: pool2
    hardware: cpu-spec
    vmSize: Standard_D4s_v3
    environment: DEV
    node-template: k8s-pool2-24760778-vmss
  tolerations:
    - key: dedicated
      operator: Equal
      value: pool2
      effect: NoSchedule
  containers:
    - name: my-test-pod
      image: ubuntu:latest
      command: ["/bin/bash", "-ec", "while :; do echo '.'; sleep 5; done"]
  restartPolicy: Never

使用ACS引擎在Azure上群集自動縮放器無法從0放大

問題描述

1 個解決方案

解決方案1
0 已采納 2018-09-24 21:36:22

使用ACS引擎在Azure上群集自動縮放器無法從0放大

問題描述

1 個解決方案

解決方案1 0 已采納 2018-09-24 21:36:22

解決方案1
0 已采納 2018-09-24 21:36:22