简体   繁体   中英

How to get Docker to recognize NVIDIA drivers?

I have a container that loads a Pytorch model. Every time I try to start it up, I get this error:

Traceback (most recent call last):
  File "server/start.py", line 166, in <module>
    start()
  File "server/start.py", line 94, in start
    app.register_blueprint(create_api(), url_prefix="/api/1")
  File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
    atomic_demo_model = DemoModel(model_filepath, comet_dir)
  File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
    model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
  File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
    model.to(cfg.device)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

I know that nvidia-docker2 is working.

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
|  0%   44C    P0    72W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
|  0%   44C    P0    66W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   44C    P0    48W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:3E:00.0 Off |                  N/A |
|  0%   41C    P0    54W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:3F:00.0 Off |                  N/A |
|  0%   42C    P0    48W / 260W |      0MiB / 10989MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   42C    P0     1W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

However, I keep getting the error above.

I've tried the following:

  1. Setting "default-runtime": nvidia in /etc/docker/daemon.json

  2. Using docker run --runtime=nvidia <IMAGE_ID>

  3. Adding the variables below to my Dockerfile:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"

I expect this container to run - we have a working version in production without these issues. And I know that Docker can find the drivers, as the output above shows. Any ideas?

I got the same error. After trying number of solutions I found the below

docker run -ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all <image_name>

In order for docker to use the host GPU drivers and GPUs, some steps are necessary.

  • Make sure an nvidia driver is installed on the host system
  • Follow the steps here to setup the nvidia container toolkit
  • Make sure cuda, cudnn is installed in the image
  • Run a container with the --gpus flag (as explained in the link above)

I guess you have done the first 3 points because nvidia-docker2 is working. So since you don't have a --gpus flag in your run command this could be the issue.

I usually run my containers with the following command

docker run --name <container_name> --gpus all -it <image_name>

-it is just that the container is interactive and starts a bash environment.

For me, I was running from a vanilla ubuntu base docker image, ie

FROM ubuntu

Changing to an Nvidia-provided Docker base image solved the issue for me:

FROM nvidia/cuda:11.2.1-runtime-ubuntu20.04

只需使用“docker run --gpus all”,添加“--gpus all”或“--gpus 0”!

If you are running your solution on a GPU powered AWS EC2 machine and are using an EKS optimized accelerated AMI , as was the case with us, then you are not required to set the runtime to nvidia by yourself, as that is the default runtime of the accelarated AMIs. The same can be verified by checking the /etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf

  • ssh into the AWS machine
  • cat /etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf

All that was required was to set these 2 environment variables, as suggested by Chirag in the above answer and here (Nvidia container-toolkit user guide)

  • -e NVIDIA_DRIVER_CAPABILITIES=compute,utility or -e NVIDIA_DRIVER_CAPABILITIES=all
  • -e NVIDIA_VISIBLE_DEVICES=all

Before reaching to the final solution, I also tried setting the runtime in daemon.json . To start with, the AMIs we were using did not have a daemon.json file, they instead contain a key.json file. Tried setting the runtime in both the files, but restarting the docker always overwrote the changes in key.json or simply deleted the daemon.json file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM