简体   繁体   中英

CUDA_ERROR_INVALID DEVICE with keras=2.0.5 and tensorflow-gpu=1.2.1

I am working with those specific specs:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:05:00.0 Off |                    0 |
| N/A   62C    P0   101W / 149W |  10912MiB / 11439MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:06:00.0 Off |                    0 |
| N/A   39C    P0    72W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:84:00.0 Off |                    0 |
| N/A   50C    P0    57W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:85:00.0 Off |                    0 |
| N/A   42C    P0    69W / 149W |  10919MiB / 11439MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

With Python 3.6, CUDA 8:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

And CUDNN 5.1.10:

#define CUDNN_MAJOR      5
#define CUDNN_MINOR      1
#define CUDNN_PATCHLEVEL 10
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

I want to run keras with GPU #1 with a tensorflow backend. Given my CUDA/CUDNN versions I understood I have to install tensorflow-gpu 1.2 and keras 2.0.5 (see here and here for the compatibilities).

First, I create a virtual environment like this:

conda create -n keras
source activate keras
conda install keras=2.0.5 tensorflow-gpu=1.2

Then, if I test the whole thing with the following script:

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import keras
model = keras.models.Sequential()
model.add(keras.layers.Dense(1,input_dim=1))
model.compile(loss="mse",optimizer="adam")
import numpy as np
model.fit(np.arange(12).reshape(-1,1),np.arange(12))

I get the following error:

Epoch 1/10
2018-12-13 15:20:42.971806: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971827: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971833: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971838: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.971843: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-12-13 15:20:42.996052: E tensorflow/core/common_runtime/direct_session.cc:138] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE

We can see in the logs that it tries to create a session on device 0, which is already taken, as shown by the nvidia-smi command. However I specified to use the number 1 in the script.

Do you have any idea what can go wrong here?

I am sorry if the question is inappropriate but I have been struggling with this for a few days now and can't seem to progress any further.

Since I solved my problem, I answer my own question.

There were 2 problems actually:

  1. When installing tensorflow-gpu=1.2, it was the version 6.0 of cudnn that was installed with it (and I had cudnn 5.1.10). The solution was to install the packages like this:

    conda install keras=2.0.5 tensorflow-gpu=1.2 cudnn=5.1.10

  2. Second problem, which was in fact the "real" problem, was that some of my old processes were still running in the background. While they were not listed in the nvidia-smi panel, they still hold the GPUs making them impossible to be accessed by my tests. Killing those processes with a kill command solve this problem

I hope those insights will help others struggling as I was.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM