简体   繁体   English

如何使用 snapd 的 lxd/lxc 容器在 centos/rhel/ol8 上启用 gpu 直通?

[英]How to enable gpu passthrough on centos/rhel/ol8 using snapd's lxd/lxc containers?

The guides I have for deploying LXC on CentOS is to install snapd's lxd https://www.cyberciti.biz/faq/set-up-use-lxd-on-centos-rhel-8-x/我在 CentOS 上部署 LXC 的指南是安装 snapd 的 lxd https://www.cyberciti.biz/faq/set-up-use-lxd-on-centos-rhel-8-x/

SnapD is a type of service that allows installing debian/ubuntu based packages with the logic being lxd is most up to date on that platform. SnapD 是一种服务,它允许安装基于 debian/ubuntu 的软件包,逻辑是 lxd 在该平台上是最新的。

Well.出色地。 I'm all open to installing an alternative version if it's easier to enable gpu passthrough.如果启用 gpu 直通更容易,我完全愿意安装替代版本。

Ultimately I'm trying to build a container environment where I can run the latest version of python and jupyter that has gpu support.最终,我正在尝试构建一个容器环境,我可以在其中运行最新版本的 python 和支持 gpu 的 jupyter。

I have some guides on how to enable gpu passthrough.我有一些关于如何启用 gpu 直通的指南。

https://theorangeone.net/posts/lxc-nvidia-gpu-passthrough/
https://www.reddit.com/r/Proxmox/comments/glog5j/lxc_gpu_passthrough/

I've added the following kernel modules on my ol8 host我在我的 ol8 主机上添加了以下 kernel 模块

/etc/modules-load.d/vfio-pci.conf
    # Nvidia modules
    nvidia
    nvidia_uvm

#noticed snapd has a modules file I can't edit  

/var/lib/snapd/snap/core18/1988/etc/modules-load.d/modules.conf
            

Then modified grub然后修改grub

nano /etc/default/grub 
    #https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/installation_guide/appe-configuring_a_hypervisor_host_for_pci_passthrough
    GRUB_CMDLINE_LINUX
    #iommu=on amd_iommu=on
    iommu=pt amd_iommu=pt
            
grub2-mkconfig -o /boot/grub2/grub.cfg

Then added udev rules然后添加udev规则

    nano /etc/udev/rules.d/70-nvidia.rules
    KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
    KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

#reboot

Then added gpu to lxc.conf然后将 gpu 添加到 lxc.conf

ls -l /dev/nvidia*

# Allow cgroup access
lxc.cgroup.devices.allow: c 195:* rwm
lxc.cgroup.devices.allow: c 243:* rwm

nano /var/snap/lxd/common/lxd/logs/nvidia-test/lxc.conf
        

# Pass through device files
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none ind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

inside lxc container I started (ol8)在我开始的 lxc 容器内(ol8)

#installed nvidia-driver that comes with nvidia-smi
    nvidia-driver-cuda-3:460.32.03-1.el8.x86_64
    
#installed cuda
    cuda-11-2-11.2.2-1.x86_64

when I go to run nvidia-smi当我 go 运行 nvidia-smi

[root@nvidia-test ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

because I couldn't edit the snapd module file thought to manually copy the nvidia kernel module files over and insmod them (determined using modprobe --show-depends )因为我无法编辑被认为手动复制 nvidia kernel 模块文件并对其进行安装的 snapd 模块文件(使用 modprobe --show-depends 确定)

[root@nvidia-test ~]# insmod nvidia.ko.xz NVreg_DynamicPowerManagement=0x02
insmod: ERROR: could not insert module nvidia.ko.xz: Function not implemented

some diagnostic information inside my container我的容器内的一些诊断信息

[root@nvidia-test ~]# find /sys | grep dmar
find: '/sys/kernel/debug': Permission denied
find: '/sys/fs/pstore': Permission denied
find: '/sys/fs/fuse/connections/59': Permission denied
[root@nvidia-test ~]# lspci | grep -i nvidia
05:00.0 VGA compatible controller: NVIDIA Corporation GP107GL [Quadro P1000] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)

So... is there something else I should do?那么……我还有什么需要做的吗? Should I remove snapd lxd and go with the default lxc provided by OL8?我应该使用 OL8 提供的默认 lxc 删除 snapd lxd 和 go 吗?

You can use GPU Passthrough to a LXD container by creating a LXD gpu device.您可以通过创建 LXD gpu设备来使用 GPU Passthrough 到 LXD 容器。 This gpu device will collectively do all the necessary tasks to expose the GPU to the container, including the configuration you made above explicitly.这个gpu设备将共同完成所有必要的任务,以将 GPU 暴露给容器,包括您在上面明确进行的配置。

Here is the documentation with all extra parameters (for example, if there are more than one GPU, how do you distinguish), https://linuxcontainers.org/lxd/docs/master/instances#type-gpu这是所有额外参数的文档(例如,如果有多个GPU,你如何区分), https://linuxcontainers.org/lxd/docs/master/instances#type-gpu

In the simplest form, you can run the following to an existing container to add the default GPU (to the container).在最简单的形式中,您可以对现有容器运行以下命令以添加默认 GPU(到容器)。

lxc config device add mycontainer mynvidia gpu

When you add a GPU in a NVidia container, you also need to add the corresponding NVidia runtime to the container (so that it matches the kernel version on the host.), In containers we do not need (and cannot) add kernel drivers but we need to add the runtime (libraries, utilities. and other software).当您在 NVidia 容器中添加 GPU 时,您还需要将相应的 NVidia 运行时添加到容器中(使其与主机上的 kernel 版本匹配。),在容器中我们不需要(也不能)添加 Z50484C019F1AFDAF382D 驱动程序但我们需要添加运行时(库、实用程序和其他软件)。 LXD takes care of this and is downloading for you the appropriate version of the NVidia container runtime and attaches it to the container. LXD 会处理这个问题,并正在为您下载适当版本的NVidia 容器运行时并将其附加到容器中。 Here is a full example that creates a container while enabling the NVidia runtime, and then adds the NVidia GPU device to that container.这是一个完整示例,它在启用 NVidia 运行时创建容器,然后将 NVidia GPU 设备添加到该容器。

$ lxc launch ubuntu: mycontainer -c nvidia.runtime=true -c nvidia.driver.capabilities=all
Creating mycontainer
Starting mycontainer
$ lxc config device add mycontainer mynvidia gpu
Device mynvidia added to mycontainer
$ lxc shell mycontainer
root@mycontainer:~# nvidia-smi 
Mon Mar 15 13:37:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
...
$ 

If you are creating often such GPU containers, you can create a LXD profile with the GPU configuration.如果您经常创建这样的 GPU 容器,您可以使用 GPU 配置创建 LXD 配置文件。 Then, if you want a GPU container, you can either launch the container with the nvidia profile, or you can apply the nvidia profile to existing containers and thus make them GPU containers!然后,如果您想要一个 GPU 容器,您可以使用nvidia配置文件启动容器,或者您可以nvidia配置文件应用于现有容器,从而使它们成为 GPU 容器!

$ cat mynvidiaLXDprofile.txt
config:
  nvidia.driver.capabilities: all
  nvidia.runtime: "true"
description: ""
devices:
  mygpu:
    type: gpu
name: nvidia
used_by: []
$ lxc profile create nvidia
Profile nvidia created
$ lxc profile edit nvidia < mynvidiaLXDprofile.txt
$ lxc launch ubuntu:20.04 mycontainer --profile default --profile nvidia
Creating mycontainer
Starting mycontainer
$ 

We have been using the snap package of LXD for all the above instructions.对于上述所有说明,我们一直在使用 LXD 的 snap package。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM