简体   繁体   English

致命错误:cuda_runtime_api.h:尝试在 docker 中使用 cuda 时没有这样的文件或目录

[英]fatal error: cuda_runtime_api.h: No such file or directory when trying to use cuda in docker

I am trying to build a docker image for a python script that I would like to deploy.我正在尝试为要部署的 python 脚本构建 docker 映像。 This is the first time I am using docker so I'm probably doing something wrong but I have no clue what.这是我第一次使用 docker,所以我可能做错了什么,但我不知道是什么。

My System:我的系统:

OS: Ubuntu 20.04
docker version: 19.03.8

I am using this Dockerfile:我正在使用这个 Dockerfile:

# Dockerfile
FROM nvidia/cuda:11.0-base

COPY . /SingleModelTest

WORKDIR /SingleModelTest

RUN nvidia-smi

RUN set -xe \           #these are just to make sure pip and git are installed to install the requirements
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt    #this is where it fails

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

The output from nvidia-smi is as expected: nvidia-smi 的输出符合预期:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   54C    P0    N/A /  90W |   1983MiB /  1995MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So cuda does work, but when I try to install the required packages from the requirements files this happens:所以 cuda 确实有效,但是当我尝试从需求文件中安装所需的包时,会发生这种情况:

     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /SingleModelTest/src/mmdet/
    Complete output (24 lines):
    running develop
    running egg_info
    creating mmdet.egg-info
    writing mmdet.egg-info/PKG-INFO
    writing dependency_links to mmdet.egg-info/dependency_links.txt
    writing requirements to mmdet.egg-info/requires.txt
    writing top-level names to mmdet.egg-info/top_level.txt
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    reading manifest file 'mmdet.egg-info/SOURCES.txt'
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    running build_ext
    building 'mmdet.ops.utils.compiling_info' extension
    creating build
    creating build/temp.linux-x86_64-3.8
    creating build/temp.linux-x86_64-3.8/mmdet
    creating build/temp.linux-x86_64-3.8/mmdet/ops
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils/src
    x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c mmdet/ops/utils/src/compiling_info.cpp -o build/temp.linux-x86_64-3.8/mmdet/ops/utils/src/compiling_info.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=compiling_info -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
    mmdet/ops/utils/src/compiling_info.cpp:3:10: fatal error: cuda_runtime_api.h: No such file or directory
        3 | #include <cuda_runtime_api.h>
          |          ^~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

The package that fails is mmdetection.失败的包是mmdetection。 I am using 2 seperate requirements files to make sure some packages are installed before others to prevent a dependency failure我正在使用 2 个单独的需求文件来确保某些软件包在其他软件包之前安装以防止依赖项失败

requirements1.txt:需求1.txt:

torch==1.4.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
torchvision==0.5.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
numpy==1.19.2

requirements2.txt:需求2.txt:

addict==2.3.0
albumentations==0.5.0
appdirs==1.4.4
asynctest==0.13.0
attrs==20.2.0
certifi==2020.6.20
chardet==3.0.4
cityscapesScripts==2.1.7
click==7.1.2
codecov==2.1.10
coloredlogs==14.0
coverage==5.3
cycler==0.10.0
Cython==0.29.21
decorator==4.4.2
flake8==3.8.4
Flask==1.1.2
humanfriendly==8.2
idna==2.10
imagecorruptions==1.1.0
imageio==2.9.0
imgaug==0.4.0
iniconfig==1.1.1
isort==5.6.4
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.2.0
kwarray==0.5.9
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
mmcv==0.4.3
-e git+https://github.com/open-mmlab/mmdetection.git@0f33c08d8d46eba8165715a0995841a975badfd4#egg=mmdet
networkx==2.5
opencv-python==4.4.0.44
opencv-python-headless==4.4.0.44
ordered-set==4.0.2
packaging==20.4
pandas==1.1.3
Pillow==6.2.2
pluggy==0.13.1
py==1.9.0
pycocotools==2.0.2
pycodestyle==2.6.0
pyflakes==2.2.0
pyparsing==2.4.7
pyquaternion==0.9.9
pytesseract==0.3.6
pytest==6.1.1
pytest-cov==2.10.1
pytest-runner==5.2
python-dateutil==2.8.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
requests==2.24.0
scikit-image==0.17.2
scipy==1.5.3
Shapely==1.7.1
six==1.15.0
terminaltables==3.1.0
tifffile==2020.9.3
toml==0.10.1
tqdm==4.50.2
typing==3.7.4.3
ubelt==0.9.2
urllib3==1.25.11
Werkzeug==1.0.1
xdoctest==0.15.0
yapf==0.30.0

The command i use to (try to) build the image: nvidia-docker build -t firstdockertestsinglemodel:latest我用来(尝试)构建映像的命令: nvidia-docker build -t firstdockertestsinglemodel:latest

Things I have tried:我尝试过的事情:

  • setting the cuda environment variables like CUDA_HOME, LIBRARY_PATH, LD_LIBRARY_PATH but I am not sure I did it correctly since I can't check the paths I set because I cant see them in the Ubuntu Files app设置 cuda 环境变量,如 CUDA_HOME、LIBRARY_PATH、LD_LIBRARY_PATH 但我不确定我做对了,因为我无法检查我设置的路径,因为我在 Ubuntu 文件应用程序中看不到它们

I'll be very grateful for any help that anyone could offer.我将非常感谢任何人可以提供的任何帮助。 If I need to supply more information I'll be happy to.如果我需要提供更多信息,我会很乐意提供。

EDIT: this answer just tells you how to verify what's happening in your docker image.编辑:这个答案只是告诉你如何验证你的 docker 镜像中发生了什么。 Unfortunately I'm unable to figure out why it is happening.不幸的是,我无法弄清楚为什么会发生这种情况。

How to check it?如何检查?

At each step of the docker build, you can see the various layers being generated.在 docker build 的每一步,您都可以看到正在生成的各个层。 You can use that ID to create a temporary image to check what's happening.您可以使用该 ID 创建一个临时图像来检查发生了什么。 eg例如

docker build -t my_bonk_example .
[...]
Removing intermediate container xxxxxxxxxxxxx
 ---> 57778e7c9788
Step 19/31 : RUN mkdir -p /tmp/spark-events
 ---> Running in afd21d853bcb
Removing intermediate container xxxxxxxxxxxxx
 ---> 33b26e1a2286 <-- let's use this ID
[ failure happens ]

docker run -it --rm --name bonk_container_before_failure 33b26e1a2286 bash
# now you're in the container

echo $LD_LIBRARY_PATH
ls /usr/local/cuda

side notes about your Dockerfile:关于你的 Dockerfile 的旁注:

you can improve the build time for future builds if you change the instructions order in your Dockerfile.如果您更改 Dockerfile 中的指令顺序,则可以缩短未来构建的构建时间。 Docker uses a cache that gets invalidated in the moment it finds something different from the previous build. Docker 使用的缓存会在发现与之前构建不同的内容时失效。 I'd expect you to change the code more often than the requirements of your docker image, so it'd make sense to move the COPY after the apt instructions.我希望您更频繁地更改代码,而不是 docker 映像的要求,因此在 apt 指令之后移动 COPY 是有意义的。 eg例如

# Dockerfile
FROM nvidia/cuda:10.2-base

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 

RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt

COPY . /SingleModelTest

RUN nvidia-smi

ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]

NOTE: this is just an example.注意:这只是一个例子。


Concerning the Why the image doesn't build, I found that PyTorch 1.4 does not support CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) but also using a previous version of CUDA does not fix the issue.关于为什么无法构建图像,我发现 PyTorch 1.4 不支持 CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) 但也使用了以前的CUDA 版本无法解决此问题。

Thanks to @Robert Crovella I solved my problem.感谢@Robert Crovella 我解决了我的问题。 Turned out I just needed to use nvidia/cuda/10.0-devel as base image instead of nvidia/cuda/10.0-base结果我只需要使用nvidia/cuda/10.0-devel作为基础映像而不是nvidia/cuda/10.0-base

so my Dockerfile is now:所以我的 Dockerfile 现在是:

# Dockerfile
FROM nvidia/cuda:10.0-devel

RUN nvidia-smi

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt


COPY . /SingleModelTest

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么Theano打印“cc1plus:致命错误:cuda_runtime.h:没有这样的文件或目录”? - Why does Theano print “cc1plus: fatal error: cuda_runtime.h: No such file or directory”? 尝试将 cuda 与 pytorch 一起使用时出现运行时错误 999 - Runtime error 999 when trying to use cuda with pytorch 尝试部署到Heroku时如何解决“严重错误:Carbon / Carbon.h:没有此类文件或目录”-(Django) - How to fix “fatal error: Carbon/Carbon.h: No such file or directory”, when trying to deploy to Heroku - (Django) 在 docker 中运行 CuFFT 时出现 CUDA cudaErrorInsufficientDriver 错误 - CUDA cudaErrorInsufficientDriver error when running an CuFFT in docker 尝试安装 tensorflow-gpu 但出现此错误:CUDA 驱动程序版本不足以用于 CUDA 运行时版本 - Trying to install tensorflow-gpu but got this error: CUDA driver version is insufficient for CUDA runtime version PyTorch给CUDA运行时错误 - PyTorch giving cuda runtime error RuntimeError:pytorch中的cuda运行时错误(8) - RuntimeError: cuda runtime error (8) in pytorch 致命错误:pyconfig.h:没有这样的文件或目录 - Fatal error: pyconfig.h: No such file or directory 致命错误:Python.h:没有这样的文件或目录 - fatal error: Python.h: No such file or directory Cuda 运行时错误 cudaErrorNoDevice:未检测到支持 CUDA 的设备 - Cuda Runtime Error cudaErrorNoDevice: no CUDA-capable device is detected
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM