简体   繁体   English

Pycharm 使用带 GPU 的 docker 进行调试

[英]Pycharm debugging using docker with GPUs

The Goal:目标:

To debug a Python application in PyCharm, where I set the interpreter to a custom docker image, using Tensorflow and so requiring a GPU. The problem is that PyCharm's command-building doesn't offer a way to discover available GPUs, as far as I can tell.要在 PyCharm 中调试 Python 应用程序,我将解释器设置为自定义 docker 图像,使用 Tensorflow,因此需要 GPU。问题是 PyCharm 的命令构建不提供发现可用 GPU 的方法,据我所知可以告诉。

Terminal - it works:终端 - 它有效:

Enter a container with the following command, specifying which GPUs to make available ( --gpus ):使用以下命令输入一个容器,指定哪些 GPU 可用 ( --gpus ):

docker run -it --rm --gpus=all --entrypoint="/bin/bash" 3b6d609a5189        # image has an entrypoint, so I overwrite it

Inside the container, I can run nvidia-smi to see a GPU is found, and confirm Tensorflow finds it, using:在容器内,我可以运行nvidia-smi以查看找到 GPU,并确认 Tensorflow 找到它,使用:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()
# physical_device_desc: "device: 0, name: Quadro P2000, pci bus id: 0000:01:00.0, compute capability: 6.1"]

If I don't use the --gpus flag, no GPUs are discovered, as expected.如果我不使用--gpus标志,则不会像预期的那样发现任何 GPU。 Note: using docker version 19.03 and above, Nvidia runtimes are supports natively, so there is no need for nvidia-docker and also, the docker-run argument --runtime=nvidia is also deprecated.注意:使用 docker 版本 19.03 及更高版本,Nvidia 运行时是本机支持的,因此不需要nvidia-docker docker,而且 docker-run 参数--runtime=nvidia也已弃用。 Relevant thread .相关线程

PyCharm - it doesn't work PyCharm - 它不起作用

Here is the configuration for the run:这是运行的配置:

配置

(I realise some of those paths might look incorrect, but that isn't an issue for now) (我意识到其中一些路径可能看起来不正确,但现在这不是问题)

I set the interpreter to point to the same docker image and run the Python script, set a custom LD_LIBRARY_PATH as an argument to the run that matches where the libcuda.so is locate d in the docker image (I found it interactively inside a running container), but still no device is found:我将解释器设置为指向相同的 docker 图像并运行 Python 脚本,将自定义LD_LIBRARY_PATH设置为与libcuda.solocate图像中的位置匹配的运行参数(我在运行的容器中以交互方式找到它), 但仍然没有找到设备:

错误信息

The error message shows the the CUDA library was able to be loaded (ie is was found on that LD_LIBRARY_PATH ), but the device was still not found.错误消息显示可以加载 CUDA 库(即在LD_LIBRARY_PATH上找到),但仍未找到设备。 This is why I believe the docker run argument --gpus=all must be set somewhere.这就是为什么我相信 docker 运行参数--gpus=all必须在某处设置。 I can't find a way to do that in PyCharm.我在 PyCharm 中找不到这样做的方法。

Other things I have tried:我尝试过的其他事情:

  1. In PyCharm, using a Docker execution template config (instead of a Python template) where it is possible to specify run arguments , so I hoped to pass --gpus=all , but that seems not to be supported by the parser of those options:在 PyCharm 中,使用 Docker 执行模板配置(而不是 Python 模板),其中可以指定运行 arguments ,所以我希望传递--gpus=all ,但这些选项的解析器似乎不支持:

解析错误

  1. I tried to set the default runtime to be nvidia in the docker daemon by including the following config in /etc/docker/daemon.json :我试图通过在/etc/docker/daemon.json中包含以下配置,将默认运行时设置为 docker 守护程序中的nvidia
 { "runtimes": { "nvidia": { "runtimeArgs": ["gpus=all"] } } }

I am not sure of the correct format for this, however.但是,我不确定正确的格式。 I have tried a few variants of the above, but nothing got the GPUs recognised.我已经尝试了上述的几个变体,但没有任何一个可以识别 GPU。 The example above could at least be parsed and allow me to restart the docker daemon without errors.上面的示例至少可以被解析并允许我重新启动 docker 守护程序而不会出现错误。

  1. I noticed in the official Tensorflow docker images, they install a package (via apt install ) called nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0 , which sounds like a great tool, albeit seemingly just for TensorRT.我在官方 Tensorflow docker 图像中注意到,他们安装了一个名为nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0的 package(通过apt install ),这听起来像是一个很棒的工具,尽管看起来只是为了张量RT。 I added it to my Dockerfile as a shot in the dark, but unfortunately it did not fix the issue.我把它添加到我的 Dockerfile 作为黑暗中的一枪,但不幸的是它没有解决问题。

  2. Adding NVIDIA_VISIBLE_DEVICES=all etc. to the environment variables of the PyCharm configuration, with no luck.NVIDIA_VISIBLE_DEVICES=all等添加到 PyCharm 配置的环境变量中,但没有成功。

I am using Python 3.6, PyCharm Professional 2019.3 and Docker 19.03.我正在使用 Python 3.6、PyCharm Professional 2019.3 和 Docker 19.03。

Docker GPUs' support is now available in PyCharm 2020.2 without global default-runtime . Docker GPU 的支持现在在 PyCharm 2020.2 中可用,没有 global default-runtime Just set --gpus all under 'Docker container settings' section in the configuration window .只需在配置窗口的“Docker 容器设置”部分下设置--gpus all

If no NVIDIA GPU device is present: /dev/nvidia0 does not exist error still occur, make sure to uncheck Run with Python Console , because it's still not working properly.如果no NVIDIA GPU device is present: /dev/nvidia0 does not exist错误仍然发生,请确保取消选中Run with Python Console ,因为它仍然无法正常工作。

It turns out that attempt 2. in the "Other things I tried" section of my post was the right direction, and using the following allowed PyCharm's remote interpreter (the docker image) locate the GPU, as the Terminal was able to.事实证明,在我的帖子的“我尝试过的其他事情”部分中的尝试 2.是正确的方向,并且使用以下允许 PyCharm 的远程解释器(docker 图像)定位 GPU,因为终端能够。

I added the following into /etc/docker/daemon.json :我将以下内容添加到/etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

It is also necessary to restart the docker service after saving the file:保存文件后还需要重启docker服务:

sudo service docker restart

Note: that kills all running docker containers on the system注意:这会杀死系统上所有正在运行的 docker 容器

Check out Michał De's answer, it works.查看 Michał De 的回答,它有效。 However, an interactive console is still broken.但是,交互式控制台仍然损坏。 With some docker inspect I figured out that using the option Run with Python Console overwrites docker config ignoring provided options --gpus all .通过一些docker inspect ,我发现使用选项Run with Python Console会覆盖 docker 配置,忽略提供的选项--gpus all I couldn't stand such a loss in quality of life and forced pycharm to play nice using docker-compose .我无法忍受这样的生活质量损失,并强迫 pycharm 使用docker-compose玩得很好。

Behold, the WORKAROUND.看哪,解决方法。


1. How to test GPU in Tensorflow 1.如何在Tensorflow中测试GPU

import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))

should return something like应该返回类似的东西

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

2. Make sure you have a simple docker container that works 2. 确保你有一个简单的 docker 容器可以工作

docker pull tensorflow/tensorflow:latest-gpu-jupyter
docker run --gpus all -it tensorflow/tensorflow:latest-gpu-jupyter python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

the last print should be as described in step 1. Otherwise see nvidia guide or tensorflow guide .最后一次打印应如步骤 1 中所述。否则请参阅nvidia 指南tensorflow 指南


3. Create a compose file and test it 3.创建一个compose文件并测试它

version: '3'
# ^ fixes another pycharm bug
services:
  test:
    image: tensorflow/tensorflow:latest-gpu-jupyter  
    # ^ or your own
    command: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  
    # ^ irrelevant, will be overwridden by pycharm, but usefull for testing
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
docker-compose --file your_compose_file up

Again, you should see the same output as described in the step 1. Given that step 2 was successful this should go without surprises.同样,您应该看到与步骤 1 中描述的相同的 output。如果步骤 2 成功,这应该是 go,这不会令人意外。


4. Set up this compose as an interpreter in pycharm 4.在pycharm中设置这个compose为解释器

  • Configuration files: your_compose_file配置文件:your_compose_file
  • Service: test (it just works, but you can have more fun )服务:测试(刚刚好,但你可以玩得更开心

在此处输入图像描述


5. Enjoy your interactive console while running a GPU enabled docker. 5. 在运行启用了 GPU 的 docker 的同时享受您的交互式控制台。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM