初始化 NVML 失败：几个小时后 Docker 出现未知错误

Question

I am having interesting and weird issue.我有一个有趣而奇怪的问题。

When I start docker container with gpu it works fine and I see all the gpus in docker.当我使用 gpu 启动 docker 容器时，它工作正常，我看到 docker 中的所有 GPU。 However, few hours or few days later, I can't use gpus in docker.但是，几个小时或几天后，我无法在 docker 中使用 gpus。

When I do nvidia-smi in docker machine.当我在 docker 机器上做nvidia-smi时。 I see this msg我看到这个消息

"Failed to initialize NVML: Unknown Error" “无法初始化 NVML：未知错误”

However, in the host machine, I see all the gpus with nvidia-smi.但是，在主机中，我看到所有带有 nvidia-smi 的 gpus。 Also, when I restart the docker machine.另外，当我重新启动 docker 机器时。 It totally works fine and showing all gpus.它完全可以正常工作并显示所有 gpus。

My Inference Docker machine should be turned on all the time and do the inference depends on server requests.我的推理 Docker 机器应该一直打开，并且推理取决于服务器请求。 Does any one have same issue or the solution for this problem?有没有人有同样的问题或这个问题的解决方案？

Answer 1

I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.我有同样的问题，我只是在容器中运行screen watch -n 1 nvidia-smi ，现在它可以连续工作。

Answer 2

I had the same Error.我有同样的错误。 I tried the health check of docker as a temporary solution.我尝试了 docker 的健康检查作为临时解决方案。 When nvidia-smi failed, the container will be marked unhealth, and restart by restart policy.当 nvidia-smi 失败时，容器将被标记为不健康，并通过重启策略重启。

Docker-compose Version: Docker-compose版本：

healthcheck:
  test: nvidia-smi || exit 1
  start_period: 60s
  interval: 20s
  timeout: 10s
  retries: 2

Dockerfile Version: Dockerfile版本：

HEALTHCHECK \
    --start-period=60s \
    --interval=20s \
    --timeout=10s \  
    --retries=2 \
    CMD nvidia-smi || exit 1

初始化 NVML 失败：几个小时后 Docker 出现未知错误

问题描述

2 个解决方案

解决方案1
0 2022-08-21 17:58:15

解决方案2
0 2022-09-13 14:05:10

初始化 NVML 失败：几个小时后 Docker 出现未知错误

问题描述

2 个解决方案

解决方案1 0 2022-08-21 17:58:15

解决方案2 0 2022-09-13 14:05:10

解决方案1
0 2022-08-21 17:58:15

解决方案2
0 2022-09-13 14:05:10