Autopilot GKE 集群 cpu/mem 不足

Question

我正在嘗試在 GKE 上部署 Autopilot 集群，但是在嘗試部署 Pod 時出現以下 CPU/內存不足錯誤。 Kubectl get nodes 返回 3 個節點，每個節點都有大約 0.5GB 的可用 CPU，對於 memory 來說也是如此，非常小。 我正在嘗試運行 GPU 繁重的工作，所以我希望 GKE 能夠擴大規模，但它未能說明資源不足。 我究竟做錯了什么？

  Warning  FailedScheduling   27m (x5 over 31m)      gke.io/optimize-utilization-scheduler  0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.
  Warning  FailedScheduling   26m                    gke.io/optimize-utilization-scheduler  0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 Insufficient cpu, 2 Insufficient memory.
  Normal   TriggeredScaleUp   26m                    cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/picdmo-342711/zones/us-central1-c/instanceGroups/gk3-picdmo-nap-1wcisjk4-2ba03e97-grp 0->1 (max: 1000)}]
  Normal   NotTriggerScaleUp  25m (x6 over 30m)      cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 2 in backoff after failed scale-up
  Normal   NotTriggerScaleUp  20m (x14 over 21m)     cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient cpu, 1 Insufficient memory
  Normal   TriggeredScaleUp   15m                    cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/picdmo-342711/zones/us-central1-c/instanceGroups/gk3-picdmo-nap-xt7d8ijc-37c84d94-grp 0->1 (max: 1000)}]
  Warning  FailedScaleUp      15m (x5 over 31m)      cluster-autoscaler                     Node scale up in zones us-central1-c associated with this pod failed: GCE quota exceeded. Pod is at risk of not being scheduled.
  Warning  FailedScheduling   15m (x6 over 20m)      gke.io/optimize-utilization-scheduler  0/4 nodes are available: 3 Insufficient memory, 4 Insufficient cpu.
  Normal   NotTriggerScaleUp  14m (x2 over 15m)      cluster-autoscaler                     (combined from similar events): pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 in backoff after failed scale-up, 2 Insufficient cpu, 1 Insufficient memory
  Warning  FailedScheduling   13m (x2 over 14m)      gke.io/optimize-utilization-scheduler  0/4 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1660665555}, that the pod didn't tolerate, 3 Insufficient cpu, 3 Insufficient memory.
  Normal   NotTriggerScaleUp  4m50s (x135 over 29m)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory
  Warning  FailedScheduling   92s (x17 over 25m)     gke.io/optimize-utilization-scheduler  0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory.```

apiVersion: batch/v1
kind: Job
metadata:
  generateName: asd-job-
spec:
  template:
    spec:
      containers:
        - name: asd
          image: gcr.io/-342711/-job:latest
          imagePullPolicy: Always
          command: ["/bin/sh"]
          args: ["-c", "echo"]
          resources:
            requests:
              memory: "16000Mi"
              cpu: "8000m"
            limits:
              memory: "32000Mi"
              cpu: "16000m"
              nvidia.com/gpu: 2
      restartPolicy: Never
  backoffLimit: 4

Answer 1

us-central1-c associated with this pod failed: GCE quota exceeded

從第 7 行開始。

可能是配額問題。 檢查 IAM 和管理 > 配額

Autopilot GKE 集群 cpu/mem 不足

問題描述

1 個解決方案

解決方案1
1 2022-08-16 19:00:38

Autopilot GKE 集群 cpu/mem 不足

問題描述

1 個解決方案

解決方案1 1 2022-08-16 19:00:38

解決方案1
1 2022-08-16 19:00:38