致命错误：cuda_runtime_api.h：尝试在 docker 中使用 cuda 时没有这样的文件或目录

Question

I am trying to build a docker image for a python script that I would like to deploy.我正在尝试为要部署的 python 脚本构建 docker 映像。 This is the first time I am using docker so I'm probably doing something wrong but I have no clue what.这是我第一次使用 docker，所以我可能做错了什么，但我不知道是什么。

My System:我的系统：

OS: Ubuntu 20.04
docker version: 19.03.8

I am using this Dockerfile:我正在使用这个 Dockerfile：

# Dockerfile
FROM nvidia/cuda:11.0-base

COPY . /SingleModelTest

WORKDIR /SingleModelTest

RUN nvidia-smi

RUN set -xe \           #these are just to make sure pip and git are installed to install the requirements
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt    #this is where it fails

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

The output from nvidia-smi is as expected: nvidia-smi 的输出符合预期：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   54C    P0    N/A /  90W |   1983MiB /  1995MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So cuda does work, but when I try to install the required packages from the requirements files this happens:所以 cuda 确实有效，但是当我尝试从需求文件中安装所需的包时，会发生这种情况：

     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /SingleModelTest/src/mmdet/
    Complete output (24 lines):
    running develop
    running egg_info
    creating mmdet.egg-info
    writing mmdet.egg-info/PKG-INFO
    writing dependency_links to mmdet.egg-info/dependency_links.txt
    writing requirements to mmdet.egg-info/requires.txt
    writing top-level names to mmdet.egg-info/top_level.txt
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    reading manifest file 'mmdet.egg-info/SOURCES.txt'
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    running build_ext
    building 'mmdet.ops.utils.compiling_info' extension
    creating build
    creating build/temp.linux-x86_64-3.8
    creating build/temp.linux-x86_64-3.8/mmdet
    creating build/temp.linux-x86_64-3.8/mmdet/ops
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils/src
    x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c mmdet/ops/utils/src/compiling_info.cpp -o build/temp.linux-x86_64-3.8/mmdet/ops/utils/src/compiling_info.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=compiling_info -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
    mmdet/ops/utils/src/compiling_info.cpp:3:10: fatal error: cuda_runtime_api.h: No such file or directory
        3 | #include <cuda_runtime_api.h>
          |          ^~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

The package that fails is mmdetection.失败的包是mmdetection。 I am using 2 seperate requirements files to make sure some packages are installed before others to prevent a dependency failure我正在使用 2 个单独的需求文件来确保某些软件包在其他软件包之前安装以防止依赖项失败

requirements1.txt:需求1.txt：

torch==1.4.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
torchvision==0.5.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
numpy==1.19.2

requirements2.txt:需求2.txt：

addict==2.3.0
albumentations==0.5.0
appdirs==1.4.4
asynctest==0.13.0
attrs==20.2.0
certifi==2020.6.20
chardet==3.0.4
cityscapesScripts==2.1.7
click==7.1.2
codecov==2.1.10
coloredlogs==14.0
coverage==5.3
cycler==0.10.0
Cython==0.29.21
decorator==4.4.2
flake8==3.8.4
Flask==1.1.2
humanfriendly==8.2
idna==2.10
imagecorruptions==1.1.0
imageio==2.9.0
imgaug==0.4.0
iniconfig==1.1.1
isort==5.6.4
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.2.0
kwarray==0.5.9
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
mmcv==0.4.3
-e git+https://github.com/open-mmlab/mmdetection.git@0f33c08d8d46eba8165715a0995841a975badfd4#egg=mmdet
networkx==2.5
opencv-python==4.4.0.44
opencv-python-headless==4.4.0.44
ordered-set==4.0.2
packaging==20.4
pandas==1.1.3
Pillow==6.2.2
pluggy==0.13.1
py==1.9.0
pycocotools==2.0.2
pycodestyle==2.6.0
pyflakes==2.2.0
pyparsing==2.4.7
pyquaternion==0.9.9
pytesseract==0.3.6
pytest==6.1.1
pytest-cov==2.10.1
pytest-runner==5.2
python-dateutil==2.8.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
requests==2.24.0
scikit-image==0.17.2
scipy==1.5.3
Shapely==1.7.1
six==1.15.0
terminaltables==3.1.0
tifffile==2020.9.3
toml==0.10.1
tqdm==4.50.2
typing==3.7.4.3
ubelt==0.9.2
urllib3==1.25.11
Werkzeug==1.0.1
xdoctest==0.15.0
yapf==0.30.0

The command i use to (try to) build the image: nvidia-docker build -t firstdockertestsinglemodel:latest我用来（尝试）构建映像的命令： nvidia-docker build -t firstdockertestsinglemodel:latest

Things I have tried:我尝试过的事情：

setting the cuda environment variables like CUDA_HOME, LIBRARY_PATH, LD_LIBRARY_PATH but I am not sure I did it correctly since I can't check the paths I set because I cant see them in the Ubuntu Files app设置 cuda 环境变量，如 CUDA_HOME、LIBRARY_PATH、LD_LIBRARY_PATH 但我不确定我做对了，因为我无法检查我设置的路径，因为我在 Ubuntu 文件应用程序中看不到它们

I'll be very grateful for any help that anyone could offer.我将非常感谢任何人可以提供的任何帮助。 If I need to supply more information I'll be happy to.如果我需要提供更多信息，我会很乐意提供。

Answer 1

EDIT: this answer just tells you how to verify what's happening in your docker image.编辑：这个答案只是告诉你如何验证你的 docker 镜像中发生了什么。 Unfortunately I'm unable to figure out why it is happening.不幸的是，我无法弄清楚为什么会发生这种情况。

How to check it?如何检查？

At each step of the docker build, you can see the various layers being generated.在 docker build 的每一步，您都可以看到正在生成的各个层。 You can use that ID to create a temporary image to check what's happening.您可以使用该 ID 创建一个临时图像来检查发生了什么。 eg例如

docker build -t my_bonk_example .
[...]
Removing intermediate container xxxxxxxxxxxxx
 ---> 57778e7c9788
Step 19/31 : RUN mkdir -p /tmp/spark-events
 ---> Running in afd21d853bcb
Removing intermediate container xxxxxxxxxxxxx
 ---> 33b26e1a2286 <-- let's use this ID
[ failure happens ]

docker run -it --rm --name bonk_container_before_failure 33b26e1a2286 bash
# now you're in the container

echo $LD_LIBRARY_PATH
ls /usr/local/cuda

side notes about your Dockerfile:关于你的 Dockerfile 的旁注：

you can improve the build time for future builds if you change the instructions order in your Dockerfile.如果您更改 Dockerfile 中的指令顺序，则可以缩短未来构建的构建时间。 Docker uses a cache that gets invalidated in the moment it finds something different from the previous build. Docker 使用的缓存会在发现与之前构建不同的内容时失效。 I'd expect you to change the code more often than the requirements of your docker image, so it'd make sense to move the COPY after the apt instructions.我希望您更频繁地更改代码，而不是 docker 映像的要求，因此在 apt 指令之后移动 COPY 是有意义的。 eg例如

# Dockerfile
FROM nvidia/cuda:10.2-base

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 

RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt

COPY . /SingleModelTest

RUN nvidia-smi

ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]

NOTE: this is just an example.注意：这只是一个例子。

Concerning the Why the image doesn't build, I found that PyTorch 1.4 does not support CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) but also using a previous version of CUDA does not fix the issue.关于为什么无法构建图像，我发现 PyTorch 1.4 不支持 CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) 但也使用了以前的CUDA 版本无法解决此问题。

Answer 2

Thanks to @Robert Crovella I solved my problem.感谢@Robert Crovella 我解决了我的问题。 Turned out I just needed to use nvidia/cuda/10.0-devel as base image instead of nvidia/cuda/10.0-base结果我只需要使用nvidia/cuda/10.0-devel作为基础映像而不是nvidia/cuda/10.0-base

so my Dockerfile is now:所以我的 Dockerfile 现在是：

# Dockerfile
FROM nvidia/cuda:10.0-devel

RUN nvidia-smi

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt


COPY . /SingleModelTest

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

致命错误：cuda_runtime_api.h：尝试在 docker 中使用 cuda 时没有这样的文件或目录

问题描述

2 个解决方案

解决方案1
2 2020-10-22 12:25:03

解决方案2
2 已采纳 2020-10-22 14:06:14

致命错误：cuda_runtime_api.h：尝试在 docker 中使用 cuda 时没有这样的文件或目录

问题描述

2 个解决方案

解决方案1 2 2020-10-22 12:25:03

解决方案2 2 已采纳 2020-10-22 14:06:14

解决方案1
2 2020-10-22 12:25:03

解决方案2
2 已采纳 2020-10-22 14:06:14