张量流中的Nvidia设备错误

Question

To test my tensorflow installation I am using the mnist example provided in tensorflow repository, but when I execute the convolutional.py script I have this output: 为了测试我的tensorflow安装，我使用tensorflow存储库中提供的mnist示例，但是当我执行convolutional.py脚本时，我有这个输出：

    I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.2405
pciBusID 0000:03:00.0
Total memory: 5.93GiB
Free memory: 5.83GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x29020c0
E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE
Traceback (most recent call last):
  File "convolutional.py", line 339, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "convolutional.py", line 284, in main
    with tf.Session() as sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1187, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 552, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

My first idea was that maybe I had problems in cuda installation but I tested using one of the examples provided for nvidia. 我的第一个想法是，我可能在cuda安装方面遇到了问题，但我使用了为nvidia提供的一个示例进行了测试。 In this case I used this example: 在这种情况下，我使用了这个例子：

NVIDIA_CUDA-8.0_Samples/6_Advanced/c++11_cuda NVIDIA_CUDA-8.0_Samples / 6_Advanced / C ++ 11_cuda

And the output is this: 输出是这样的：

GPU Device 0: "GeForce GTX 980 Ti" with compute capability 5.2

Read 3223503 byte corpus from ./warandpeace.txt
counted 107310 instances of 'x', 'y', 'z', or 'w' in "./warandpeace.txt"

Then my conclusion is the cuda is installed correctly. 然后我的结论是cuda安装正确。 But I don not have any idea what is happening here. 但我不知道这里发生了什么。 If someone can help me I will appreciated. 如果有人可以帮助我，我会很感激。

For more information this is my gpu configuration: 有关更多信息，这是我的gpu配置：

Tue Jan 31 19:42:10 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 560 Ti  Off  | 0000:01:00.0     N/A |                  N/A |
| 25%   45C    P0    N/A /  N/A |    463MiB /   958MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 980 Ti  Off  | 0000:03:00.0     Off |                  N/A |
|  0%   31C    P8    13W / 280W |      1MiB /  6077MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+

EDIT: 编辑：

It is normal the two nvidia cards have the same physical id? 这两个nvidia卡具有相同的物理ID是正常的吗？

sudo lshw -C "display"
  *-display               
       description: VGA compatible controller
       product: GM200 [GeForce GTX 980 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:03:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:50 memory:f9000000-f9ffffff memory:b0000000-bfffffff memory:c0000000-c1ffffff ioport:d000(size=128) memory:fa000000-fa07ffff
  *-display
       description: VGA compatible controller
       product: GF114 [GeForce GTX 560 Ti]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:01:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: irq:45 memory:f6000000-f7ffffff memory:c8000000-cfffffff memory:d0000000-d3ffffff ioport:e000(size=128) memory:f8000000-f807ffff

Answer 1

The important points in the output you have shown is this: 您显示的输出中的重点是：

I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.2405
pciBusID 0000:03:00.0
Total memory: 5.93GiB
Free memory: 5.83GiB

ie the compute device you want is enumerated as device 0 and 即您想要的计算设备被枚举为设备0和

E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE

ie the compute device generating the error is enumerated as device 1. Device 1 is your display GPU, which can't be used for computation in Tensorflow. 即产生错误的计算设备被枚举为设备1.设备1是您的显示GPU，不能用于Tensorflow中的计算。 If you either mark that device as compute prohibited with nvidia-smi , or use the CUDA_VISIBLE_DEVICES environment variable to only make your compute device visible to CUDA, the error should probably disappear. 如果您将该设备标记为nvidia-smi禁止的计算，或者使用CUDA_VISIBLE_DEVICES环境变量仅使您的计算设备对CUDA可见，则错误应该可能消失。

Answer 2

I encountered a similar error when I attempted to run the classify_image.py script that is part of the image recognition tutorial . 当我尝试运行作为图像识别教程一部分的classify_image.py脚本时，我遇到了类似的错误。 Since I already had a running Python session (elpy) in which I had run some TensorFlow code, the GPUs were allocated there and thus were not available for the script I was attempting to run from shell. 由于我已经运行了一些运行Python会话（elpy），我在其中运行了一些TensorFlow代码，因此GPU被分配在那里，因此无法用于我尝试从shell运行的脚本。

Quitting the existing Python session resolved the error. 退出现有的Python会话解决了错误。

张量流中的Nvidia设备错误

问题描述

2 个解决方案

解决方案1
4 已采纳

解决方案2
0 2017-08-04 16:03:32

张量流中的Nvidia设备错误

问题描述

2 个解决方案

解决方案1 4 已采纳

解决方案2 0 2017-08-04 16:03:32

解决方案1
4 已采纳

解决方案2
0 2017-08-04 16:03:32