简体   繁体   English

如何正确 label 并配置 Kubernetes 以使用 Nvidia GPU?

[英]How to properly label and configure Kubernetes to use Nvidia GPUs?

I have an in house K8s cluster running on bare metal.我有一个在裸机上运行的内部 K8s 集群。 On one of my worker nodes I have 4 GPUs and I want to configure K8s to recognise and use these GPUs.在我的一个工作节点上,我有 4 个 GPU,我想配置 K8s 以识别和使用这些 GPU。 Based on the official documentation I installed all the required stuff and now when I run:根据官方文档,我安装了所有必需的东西,现在当我运行时:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi


Tue Nov 12 09:20:20 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:02:00.0 Off |                  N/A |
| 29%   25C    P8     2W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:03:00.0 Off |                  N/A |
| 29%   25C    P8     1W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:82:00.0 Off |                  N/A |
| 29%   26C    P8     2W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:83:00.0 Off |                  N/A |
| 29%   26C    P8    12W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I know that I have to label the node so K8s recognise these GPUs but I can't find the correct labels on the official documentations.我知道我必须对节点进行 label 以便 K8s 识别这些 GPU,但我在官方文档中找不到正确的标签。 On the docs I just see this:在文档上,我只看到了这个:

# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80

While on another tutorial (just for google cloude) I found this:在另一个教程(仅适用于 google cloude)中,我发现了这一点:

aliyun.accelerator/nvidia_count=1                          #This field is important.
aliyun.accelerator/nvidia_mem=12209MiB
aliyun.accelerator/nvidia_name=Tesla-M40

So what is the proper way to label my node?那么 label 我的节点的正确方法是什么? Do I need to also label it with the number and memory size of GPUs?我是否还需要 label 以及 GPU 的数量和 memory 大小?

I see you are trying to make sure that your pod gets scheduled on a node with GPUs我看到您正在尝试确保您的 pod 被安排在具有 GPU 的节点上

The easiest way to do it would be to label a node with GPU like this:最简单的方法是 label 一个带有 GPU 的节点,如下所示:

kubectl label node <node_name> has_gpu=true

and then creating your pod add nodeSelector fied with has_gpu: true .然后用has_gpu: true创建你的 pod 添加nodeSelector In this way pod will be scheduled only on nodes with GPUs.这样,pod 将仅在具有 GPU 的节点上调度。 Read more here in k8s docs 在 k8s 文档中阅读更多内容

The only problem with it is that in this case scheduler is not aware of how many GPUs are on the node and can schedule more than 4 pods on the node with only 4 GPUs.唯一的问题是,在这种情况下,调度程序不知道节点上有多少 GPU,并且可以在只有 4 个 GPU 的节点上调度超过 4 个 Pod。

Better option would be to use node extended resource更好的选择是使用节点扩展资源

It would look like follows:它如下所示:

  1. run kubectl proxy运行kubectl proxy
  2. patch node resource configuration : 补丁节点资源配置

     curl --header "Content-Type: application/json-patch+json" \ --request PATCH \ --data '[{"op": "add", "path": "/status/capacity/example.com~1gpu", "value": "4"}]' \ http://localhost:8001/api/v1/nodes/<your-node-name>/status
  3. assign an extender resource to a pod 将扩展器资源分配给 pod

     apiVersion: v1 kind: Pod metadata: name: extended-resource-demo spec: containers: - name: extended-resource-demo-ctr image: my_pod_name resources: requests: example.com/gpu: 1 limits: example.com/gpu: 1

In this case scheduler is aware how many GPUs are available on the node and won't schedule more pods if cannot satisfy requests.在这种情况下,调度程序知道节点上有多少 GPU 可用,如果不能满足请求,则不会调度更多的 pod。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Kubernetes中传递Docker CLI`-gpus`选项或启用GPU支持而不安装`nvidia-docker2`(Docker 19.03) - How to pass Docker CLI `--gpus` Options in Kubernetes or enable GPU support without installing `nvidia-docker2` (Docker 19.03) kubernetes集群如何正确配置环境? - How to properly configure the environment in kubernetes cluster? 如何在 kubernetes 吊舱中使用 nvidia gpu? - How can I use nvidia gpu in kubernetes pod? 如何正确配置 Kube.netes 探测计时(针对 Spring 启动应用程序) - How to properly configure Kubernetes probes timing (for Spring Boot Application) 如何正确使用Kubernetes进行作业调度? - How to properly use Kubernetes for job scheduling? 如何在 docker 19.03 中没有命令“--gpus all”的情况下将所有 GPU 暴露给 Kubernetes? - How to expose all GPUs to Kubernetes without the command "--gpus all" in docker 19.03? 如何在 kube.netes 中配置 keycloak? - How to configure keycloak in kubernetes? 如何使用Istio的Prometheus配置kubernetes hpa? - How to use Istio's Prometheus to configure kubernetes hpa? 如何配置kubernetes(microk8s)以使用本地docker镜像? - How to configure kubernetes (microk8s) to use local docker images? 如何配置kubernetes集群使用扁平组网 - How do I configure a kubernetes cluster to use flat networking
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM