[英]fatal error: cuda_runtime_api.h: No such file or directory when trying to use cuda in docker
I am trying to build a docker image for a python script that I would like to deploy.我正在尝试为要部署的 python 脚本构建 docker 映像。 This is the first time I am using docker so I'm probably doing something wrong but I have no clue what.
这是我第一次使用 docker,所以我可能做错了什么,但我不知道是什么。
My System:我的系统:
OS: Ubuntu 20.04
docker version: 19.03.8
I am using this Dockerfile:我正在使用这个 Dockerfile:
# Dockerfile
FROM nvidia/cuda:11.0-base
COPY . /SingleModelTest
WORKDIR /SingleModelTest
RUN nvidia-smi
RUN set -xe \ #these are just to make sure pip and git are installed to install the requirements
&& apt-get update \
&& apt-get install python3-pip -y \
&& apt-get install git -y
RUN pip3 install --upgrade pip
RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt #this is where it fails
ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]
The output from nvidia-smi is as expected: nvidia-smi 的输出符合预期:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 Off | 00000000:01:00.0 On | N/A |
| 0% 54C P0 N/A / 90W | 1983MiB / 1995MiB | 18% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
So cuda does work, but when I try to install the required packages from the requirements files this happens:所以 cuda 确实有效,但是当我尝试从需求文件中安装所需的包时,会发生这种情况:
command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
cwd: /SingleModelTest/src/mmdet/
Complete output (24 lines):
running develop
running egg_info
creating mmdet.egg-info
writing mmdet.egg-info/PKG-INFO
writing dependency_links to mmdet.egg-info/dependency_links.txt
writing requirements to mmdet.egg-info/requires.txt
writing top-level names to mmdet.egg-info/top_level.txt
writing manifest file 'mmdet.egg-info/SOURCES.txt'
reading manifest file 'mmdet.egg-info/SOURCES.txt'
writing manifest file 'mmdet.egg-info/SOURCES.txt'
running build_ext
building 'mmdet.ops.utils.compiling_info' extension
creating build
creating build/temp.linux-x86_64-3.8
creating build/temp.linux-x86_64-3.8/mmdet
creating build/temp.linux-x86_64-3.8/mmdet/ops
creating build/temp.linux-x86_64-3.8/mmdet/ops/utils
creating build/temp.linux-x86_64-3.8/mmdet/ops/utils/src
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c mmdet/ops/utils/src/compiling_info.cpp -o build/temp.linux-x86_64-3.8/mmdet/ops/utils/src/compiling_info.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=compiling_info -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
mmdet/ops/utils/src/compiling_info.cpp:3:10: fatal error: cuda_runtime_api.h: No such file or directory
3 | #include <cuda_runtime_api.h>
| ^~~~~~~~~~~~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.
The package that fails is mmdetection.失败的包是mmdetection。 I am using 2 seperate requirements files to make sure some packages are installed before others to prevent a dependency failure
我正在使用 2 个单独的需求文件来确保某些软件包在其他软件包之前安装以防止依赖项失败
requirements1.txt:需求1.txt:
torch==1.4.0+cu100
-f https://download.pytorch.org/whl/torch_stable.html
torchvision==0.5.0+cu100
-f https://download.pytorch.org/whl/torch_stable.html
numpy==1.19.2
requirements2.txt:需求2.txt:
addict==2.3.0
albumentations==0.5.0
appdirs==1.4.4
asynctest==0.13.0
attrs==20.2.0
certifi==2020.6.20
chardet==3.0.4
cityscapesScripts==2.1.7
click==7.1.2
codecov==2.1.10
coloredlogs==14.0
coverage==5.3
cycler==0.10.0
Cython==0.29.21
decorator==4.4.2
flake8==3.8.4
Flask==1.1.2
humanfriendly==8.2
idna==2.10
imagecorruptions==1.1.0
imageio==2.9.0
imgaug==0.4.0
iniconfig==1.1.1
isort==5.6.4
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.2.0
kwarray==0.5.9
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
mmcv==0.4.3
-e git+https://github.com/open-mmlab/mmdetection.git@0f33c08d8d46eba8165715a0995841a975badfd4#egg=mmdet
networkx==2.5
opencv-python==4.4.0.44
opencv-python-headless==4.4.0.44
ordered-set==4.0.2
packaging==20.4
pandas==1.1.3
Pillow==6.2.2
pluggy==0.13.1
py==1.9.0
pycocotools==2.0.2
pycodestyle==2.6.0
pyflakes==2.2.0
pyparsing==2.4.7
pyquaternion==0.9.9
pytesseract==0.3.6
pytest==6.1.1
pytest-cov==2.10.1
pytest-runner==5.2
python-dateutil==2.8.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
requests==2.24.0
scikit-image==0.17.2
scipy==1.5.3
Shapely==1.7.1
six==1.15.0
terminaltables==3.1.0
tifffile==2020.9.3
toml==0.10.1
tqdm==4.50.2
typing==3.7.4.3
ubelt==0.9.2
urllib3==1.25.11
Werkzeug==1.0.1
xdoctest==0.15.0
yapf==0.30.0
The command i use to (try to) build the image: nvidia-docker build -t firstdockertestsinglemodel:latest
我用来(尝试)构建映像的命令:
nvidia-docker build -t firstdockertestsinglemodel:latest
Things I have tried:我尝试过的事情:
I'll be very grateful for any help that anyone could offer.我将非常感谢任何人可以提供的任何帮助。 If I need to supply more information I'll be happy to.
如果我需要提供更多信息,我会很乐意提供。
EDIT: this answer just tells you how to verify what's happening in your docker image.编辑:这个答案只是告诉你如何验证你的 docker 镜像中发生了什么。 Unfortunately I'm unable to figure out why it is happening.
不幸的是,我无法弄清楚为什么会发生这种情况。
How to check it?如何检查?
At each step of the docker build, you can see the various layers being generated.在 docker build 的每一步,您都可以看到正在生成的各个层。 You can use that ID to create a temporary image to check what's happening.
您可以使用该 ID 创建一个临时图像来检查发生了什么。 eg
例如
docker build -t my_bonk_example .
[...]
Removing intermediate container xxxxxxxxxxxxx
---> 57778e7c9788
Step 19/31 : RUN mkdir -p /tmp/spark-events
---> Running in afd21d853bcb
Removing intermediate container xxxxxxxxxxxxx
---> 33b26e1a2286 <-- let's use this ID
[ failure happens ]
docker run -it --rm --name bonk_container_before_failure 33b26e1a2286 bash
# now you're in the container
echo $LD_LIBRARY_PATH
ls /usr/local/cuda
side notes about your Dockerfile:关于你的 Dockerfile 的旁注:
you can improve the build time for future builds if you change the instructions order in your Dockerfile.如果您更改 Dockerfile 中的指令顺序,则可以缩短未来构建的构建时间。 Docker uses a cache that gets invalidated in the moment it finds something different from the previous build.
Docker 使用的缓存会在发现与之前构建不同的内容时失效。 I'd expect you to change the code more often than the requirements of your docker image, so it'd make sense to move the COPY after the apt instructions.
我希望您更频繁地更改代码,而不是 docker 映像的要求,因此在 apt 指令之后移动 COPY 是有意义的。 eg
例如
# Dockerfile
FROM nvidia/cuda:10.2-base
RUN set -xe \
&& apt-get update \
&& apt-get install python3-pip -y \
&& apt-get install git -y
RUN pip3 install --upgrade pip
WORKDIR /SingleModelTest
COPY requirements /SingleModelTest/requirements
RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt
COPY . /SingleModelTest
RUN nvidia-smi
ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]
NOTE: this is just an example.注意:这只是一个例子。
Concerning the Why the image doesn't build, I found that PyTorch 1.4 does not support CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) but also using a previous version of CUDA does not fix the issue.关于为什么无法构建图像,我发现 PyTorch 1.4 不支持 CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) 但也使用了以前的CUDA 版本无法解决此问题。
Thanks to @Robert Crovella I solved my problem.感谢@Robert Crovella 我解决了我的问题。 Turned out I just needed to use
nvidia/cuda/10.0-devel
as base image instead of nvidia/cuda/10.0-base
结果我只需要使用
nvidia/cuda/10.0-devel
作为基础映像而不是nvidia/cuda/10.0-base
so my Dockerfile is now:所以我的 Dockerfile 现在是:
# Dockerfile
FROM nvidia/cuda:10.0-devel
RUN nvidia-smi
RUN set -xe \
&& apt-get update \
&& apt-get install python3-pip -y \
&& apt-get install git -y
RUN pip3 install --upgrade pip
WORKDIR /SingleModelTest
COPY requirements /SingleModelTest/requirements
RUN export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt
COPY . /SingleModelTest
ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.