来自节点池的 GKE 上的不可调度的 GPU 工作负载

Question

我正在 GKE Standard 上按需运行 GPU 密集型工作负载，其中我创建了具有最小 0 和最大 5 个节点的适当节点池。 但是，当在节点池上调度 Job 时，GKE 会出现以下错误：

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   59s (x2 over 60s)  default-scheduler   0/1 nodes are available: 1 node(s) didn't match Pod's node affinity/selector.
  Normal   NotTriggerScaleUp  58s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate, 1 in backoff after failed scale-up

我已经根据文档设置了 nodeSelector 并启用了自动缩放，我可以确认它确实找到了节点池，尽管出现“与 Pod 的节点关联/选择器不匹配”的错误并尝试扩展集群。 但是此后不久它就失败了，说 0/1 节点可用？ 这是完全错误的，看到节点池中使用了 0/5 个节点。 我在这里做错了什么？

Answer 1

对于node(s) didn't match Pod's node ，您不共享清单文件的详细信息，但假设他有以下行：

nodeSelector: 
nodePool: cluster

您可以选择从 YAML 文件中删除这些行。 或者，另一种选择是将nodePool: cluster作为 label 添加到所有节点，然后将使用可用的选择器来调度 pod。 以下命令可能对您有用：

kubectl label nodes <your node name> nodePool=cluster

关于1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate消息，您可以按照@gohm'c 的建议进行操作，或者您也可以按顺序使用以下命令要从主节点中删除taint ，您应该能够在该节点上安排您的 pod：

kubectl taint nodes  <your node name> node-role.kubernetes.io/master-
kubectl taint nodes  <your node name> node-role.kubernetes.io/master-

您可以使用以下线程作为参考，它们具有来自真实案例的信息， Error: FailedScheduling: nodes didn't match node selector and Node has taints that the pod didn't容忍错误。

Answer 2

1 node(s) had taint {nvidia.com/gpu: present}, that the pod didn't tolerate...

尝试在您的工作的 pod 规范中添加tolerations ：

...
spec:
  containers:
  - name: ...
    ...
  tolerations:
  - key: nvidia.com/gpu
    value: present
    operator: Exists

来自节点池的 GKE 上的不可调度的 GPU 工作负载

问题描述

2 个解决方案

解决方案1
1 2021-12-15 20:12:12

解决方案2
0 2021-12-15 14:54:38

来自节点池的 GKE 上的不可调度的 GPU 工作负载

问题描述

2 个解决方案

解决方案1 1 2021-12-15 20:12:12

解决方案2 0 2021-12-15 14:54:38

解决方案1
1 2021-12-15 20:12:12

解决方案2
0 2021-12-15 14:54:38