简体   繁体   English

nvidia-docker 中的 TensorFlow:对 cuInit 的调用失败:CUDA_ERROR_UNKNOWN

[英]TensorFlow in nvidia-docker: failed call to cuInit: CUDA_ERROR_UNKNOWN

I have been working on getting an application that relies on TensorFlow to work as a docker container with nvidia-docker .我一直在努力让一个依赖 TensorFlow 的应用程序作为一个带有nvidia-docker的 docker 容器工作。 I have compiled my application on top of the tensorflow/tensorflow:latest-gpu-py3 image.我已经在tensorflow/tensorflow:latest-gpu-py3图像之上编译了我的应用程序。 I run my docker container with the following command:我使用以下命令运行我的 docker 容器:

sudo nvidia-docker run -d -p 9090:9090 -v /src/weights:/weights myname/myrepo:mylabel

When looking at the logs through portainer I see the following:通过portainer查看日志时,我看到以下内容:

2017-05-16 03:41:47.715682: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715948: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715978: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.716002: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.718076: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-16 03:41:47.718177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 1e22bdaf82f1
2017-05-16 03:41:47.718216: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 1e22bdaf82f1
2017-05-16 03:41:47.718298: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.57.0
2017-05-16 03:41:47.718398: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.57  Mon Oct  3 20:37:01 PDT 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
2017-05-16 03:41:47.718455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
2017-05-16 03:41:47.718484: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.57.0

The container does seem to start properly, and my application does appear to be running.容器似乎可以正常启动,并且我的应用程序似乎正在运行。 When I send requests to it for predictions the predictions are returned correctly - however at the slow speed I would expect when running inference on the CPU, so I think it's pretty clear that the GPU is not being used for some reason.当我向它发送预测请求时,预测会正确返回 - 但是在 CPU 上运行推理时速度会很慢,所以我认为很明显 GPU 出于某种原因没有被使用。 I've also tried running nvidia-smi from within that same container to make sure it is seeing my GPU and these are the results for that:我还尝试在同一个容器中运行nvidia-smi以确保它可以看到我的 GPU,结果如下:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K1             Off  | 0000:00:07.0     Off |                  N/A |
| N/A   28C    P8     7W /  31W |     25MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I'm certainly no expert in this - but it does appear that the GPU is visible from inside the container.我当然不是这方面的专家——但似乎从容器内部可以看到 GPU。 Any ideas on how to get this working with TensorFlow?关于如何使用 TensorFlow 进行这项工作的任何想法?

I run tensorflow on my ubuntu16.04 desktop.我在我的 ubuntu16.04 桌面上运行 tensorflow。

I run code with GPU works well days before.我用 GPU 运行代码几天前运行良好。 But today I cannot find gpu device with below code但是今天我找不到具有以下代码的 gpu 设备

import tensorflow as tf from tensorflow.python.client import device_lib as _device_lib with tf.Session() as sess: local_device_protos = _device_lib.list_local_devices() print(local_device_protos) [print(x.name) for x in local_device_protos]

And I realize the below issue , when I run tf.Session()当我运行tf.Session()时,我意识到以下问题

cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN cuda_driver.cc:406] 调用 cuInit 失败:CUDA_ERROR_UNKNOWN

I check my Nvidia driver in the system details, and nvcc -V , nvida-smi to check driver ,cuda and cudnn.我在系统详细信息中检查了我的 Nvidia 驱动程序,并通过nvcc -Vnvida-smi检查驱动程序、cuda 和 cudnn。 Everything seems well.一切似乎都很好。

Then I went to Additional Drivers to check driver detail, there I find there are many versions of the NVIDIA driver and the latest version selected.然后我去Additional Drivers查看驱动详情,我发现NVIDIA驱动有很多版本,并且选择了最新版本。 But when I first install the driver there is only one.但是当我第一次安装驱动程序时,只有一个。

So I select a old version, and apply the change.所以我选择一个旧版本,并应用更改。 在此处输入图像描述

Then I run the tf.Session() the issue is also here.然后我运行tf.Session()问题也在这里。 I think I should reboot my computer, after I rebooted it, this issue gone.我想我应该重新启动计算机,重新启动后,这个问题就消失了。

sess = tf.Session() 2018-07-01 12:02:41.336648: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-07-01 12:02:41.464166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-07-01 12:02:41.464482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.27GiB 2018-07-01 12:02:41.464494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-07-01 12:02:42.308689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-07-01 12:02:42.308721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-07-01 12:02:42.308729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-07-01 12:02:42.309686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7022 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability:

Maybe the problem is related to JIT caching files permissions, created by GPU.也许问题与由 GPU 创建的 JIT 缓存文件权限有关。 On linux, by default, cache files were created at ~/.nv/ComputeCache.在 linux 上,默认情况下,缓存文件是在 ~/.nv/ComputeCache 中创建的。 Setting another directory for JIT cache solves the problem.JIT 缓存设置另一个目录可以解决问题。 Just do做就是了

export CUDA_CACHE_PATH=/tmp/nvidia

before running something on GPU.在 GPU 上运行某些东西之前。

I tried installing nvidia-modrpobe, but still the same error.我尝试安装 nvidia-modrpobe,但仍然是同样的错误。 Then a simple system reboot worked for me然后一个简单的系统重启对我有用

In my case this command fails:就我而言,此命令失败:

docker run --gpus all --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu \                                                                                                                                                     
   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

Adding --privileged solves the problem:添加--privileged可以解决问题:

docker run --gpus all --runtime=nvidia --privileged -it --rm tensorflow/tensorflow:latest-gpu \                                                                                                                                                     
   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 tensorflow(不是 tensorflow-gpu):对 cuInit 的调用失败:未知错误(303) - tensorflow (not tensorflow-gpu): failed call to cuInit: UNKNOWN ERROR (303) TensorFlow 2.1 调用 cuInit 失败:未知错误 (303) - TensorFlow 2.1 failed call to cuInit: UNKNOWN ERROR (303) A100 tensorflow gpu 错误:“调用 cuInit 失败:CUDA_ERROR_NOT_INITIALIZED:初始化错误” - A100 tensorflow gpu error: "Failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error" 如何删除 cuInit 失败:CUDA 中的未知错误 (PyCuda) - How to remove cuInit failed: unknown error in CUDA (PyCuda) CUDA cuInit:未知 CUDA 错误值 | Blender 与 Google Colab - CUDA cuInit: Unknown CUDA error value | Blender with Google Colab 无法加载动态库“libcudart.so.11.0”? / 调用 cuInit 失败:未知错误 (303)? - Could not load dynamic library 'libcudart.so.11.0' ? / failed call to cuInit: UNKNOWN ERROR (303)? 用于 Python 的 Nvidia-Docker API? - Nvidia-Docker API for Python? 打开现有的nvidia-docker容器 - Open existing nvidia-docker container CUDA_ERROR_LAUNCH_FAILED 与 Tensorflow 和 Keras - CUDA_ERROR_LAUNCH_FAILED with Tensorflow and Keras 无法在 tensorflow 中调用 CUDA 求解器(cuSolverDN call failed with status =7) - Fail to call the CUDA solver in tensorflow (cuSolverDN call failed with status =7)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM