简体   繁体   中英

fatal error: cuda_runtime_api.h: No such file or directory when trying to use cuda in docker

I am trying to build a docker image for a python script that I would like to deploy. This is the first time I am using docker so I'm probably doing something wrong but I have no clue what.

My System:

OS: Ubuntu 20.04
docker version: 19.03.8

I am using this Dockerfile:

# Dockerfile
FROM nvidia/cuda:11.0-base

COPY . /SingleModelTest

WORKDIR /SingleModelTest

RUN nvidia-smi

RUN set -xe \           #these are just to make sure pip and git are installed to install the requirements
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt    #this is where it fails

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

The output from nvidia-smi is as expected:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050    Off  | 00000000:01:00.0  On |                  N/A |
|  0%   54C    P0    N/A /  90W |   1983MiB /  1995MiB |     18%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So cuda does work, but when I try to install the required packages from the requirements files this happens:

     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: /SingleModelTest/src/mmdet/
    Complete output (24 lines):
    running develop
    running egg_info
    creating mmdet.egg-info
    writing mmdet.egg-info/PKG-INFO
    writing dependency_links to mmdet.egg-info/dependency_links.txt
    writing requirements to mmdet.egg-info/requires.txt
    writing top-level names to mmdet.egg-info/top_level.txt
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    reading manifest file 'mmdet.egg-info/SOURCES.txt'
    writing manifest file 'mmdet.egg-info/SOURCES.txt'
    running build_ext
    building 'mmdet.ops.utils.compiling_info' extension
    creating build
    creating build/temp.linux-x86_64-3.8
    creating build/temp.linux-x86_64-3.8/mmdet
    creating build/temp.linux-x86_64-3.8/mmdet/ops
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils
    creating build/temp.linux-x86_64-3.8/mmdet/ops/utils/src
    x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -DWITH_CUDA -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c mmdet/ops/utils/src/compiling_info.cpp -o build/temp.linux-x86_64-3.8/mmdet/ops/utils/src/compiling_info.o -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=compiling_info -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
    mmdet/ops/utils/src/compiling_info.cpp:3:10: fatal error: cuda_runtime_api.h: No such file or directory
        3 | #include <cuda_runtime_api.h>
          |          ^~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/SingleModelTest/src/mmdet/setup.py'"'"'; __file__='"'"'/SingleModelTest/src/mmdet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

The package that fails is mmdetection. I am using 2 seperate requirements files to make sure some packages are installed before others to prevent a dependency failure

requirements1.txt:

torch==1.4.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
torchvision==0.5.0+cu100 
-f https://download.pytorch.org/whl/torch_stable.html
numpy==1.19.2

requirements2.txt:

addict==2.3.0
albumentations==0.5.0
appdirs==1.4.4
asynctest==0.13.0
attrs==20.2.0
certifi==2020.6.20
chardet==3.0.4
cityscapesScripts==2.1.7
click==7.1.2
codecov==2.1.10
coloredlogs==14.0
coverage==5.3
cycler==0.10.0
Cython==0.29.21
decorator==4.4.2
flake8==3.8.4
Flask==1.1.2
humanfriendly==8.2
idna==2.10
imagecorruptions==1.1.0
imageio==2.9.0
imgaug==0.4.0
iniconfig==1.1.1
isort==5.6.4
itsdangerous==1.1.0
Jinja2==2.11.2
kiwisolver==1.2.0
kwarray==0.5.9
MarkupSafe==1.1.1
matplotlib==3.3.2
mccabe==0.6.1
mmcv==0.4.3
-e git+https://github.com/open-mmlab/mmdetection.git@0f33c08d8d46eba8165715a0995841a975badfd4#egg=mmdet
networkx==2.5
opencv-python==4.4.0.44
opencv-python-headless==4.4.0.44
ordered-set==4.0.2
packaging==20.4
pandas==1.1.3
Pillow==6.2.2
pluggy==0.13.1
py==1.9.0
pycocotools==2.0.2
pycodestyle==2.6.0
pyflakes==2.2.0
pyparsing==2.4.7
pyquaternion==0.9.9
pytesseract==0.3.6
pytest==6.1.1
pytest-cov==2.10.1
pytest-runner==5.2
python-dateutil==2.8.1
pytz==2020.1
PyWavelets==1.1.1
PyYAML==5.3.1
requests==2.24.0
scikit-image==0.17.2
scipy==1.5.3
Shapely==1.7.1
six==1.15.0
terminaltables==3.1.0
tifffile==2020.9.3
toml==0.10.1
tqdm==4.50.2
typing==3.7.4.3
ubelt==0.9.2
urllib3==1.25.11
Werkzeug==1.0.1
xdoctest==0.15.0
yapf==0.30.0

The command i use to (try to) build the image: nvidia-docker build -t firstdockertestsinglemodel:latest

Things I have tried:

  • setting the cuda environment variables like CUDA_HOME, LIBRARY_PATH, LD_LIBRARY_PATH but I am not sure I did it correctly since I can't check the paths I set because I cant see them in the Ubuntu Files app

I'll be very grateful for any help that anyone could offer. If I need to supply more information I'll be happy to.

EDIT: this answer just tells you how to verify what's happening in your docker image. Unfortunately I'm unable to figure out why it is happening.

How to check it?

At each step of the docker build, you can see the various layers being generated. You can use that ID to create a temporary image to check what's happening. eg

docker build -t my_bonk_example .
[...]
Removing intermediate container xxxxxxxxxxxxx
 ---> 57778e7c9788
Step 19/31 : RUN mkdir -p /tmp/spark-events
 ---> Running in afd21d853bcb
Removing intermediate container xxxxxxxxxxxxx
 ---> 33b26e1a2286 <-- let's use this ID
[ failure happens ]

docker run -it --rm --name bonk_container_before_failure 33b26e1a2286 bash
# now you're in the container

echo $LD_LIBRARY_PATH
ls /usr/local/cuda

side notes about your Dockerfile:

you can improve the build time for future builds if you change the instructions order in your Dockerfile. Docker uses a cache that gets invalidated in the moment it finds something different from the previous build. I'd expect you to change the code more often than the requirements of your docker image, so it'd make sense to move the COPY after the apt instructions. eg

# Dockerfile
FROM nvidia/cuda:10.2-base

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 

RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt

COPY . /SingleModelTest

RUN nvidia-smi

ENTRYPOINT ["python"]
CMD ["TabNetAPI.py"]

NOTE: this is just an example.


Concerning the Why the image doesn't build, I found that PyTorch 1.4 does not support CUDE 11.0 ( https://discuss.pytorch.org/t/pytorch-with-cuda-11-compatibility/89254 ) but also using a previous version of CUDA does not fix the issue.

Thanks to @Robert Crovella I solved my problem. Turned out I just needed to use nvidia/cuda/10.0-devel as base image instead of nvidia/cuda/10.0-base

so my Dockerfile is now:

# Dockerfile
FROM nvidia/cuda:10.0-devel

RUN nvidia-smi

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y 
RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt


COPY . /SingleModelTest

ENTRYPOINT ["python"]

CMD ["TabNetAPI.py"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM