TensorFlow 2.1 與 NVidia GPU 返回警告和錯誤：缺少庫、NUMA、卷積操作

Question

我正在嘗試在tensorflow 2.1.0中訓練神經網絡。 我已經安裝了所有必要的軟件來配置我的 NVidia RTX 2070 GPU。 事實上，當我輸入： tf.test.is_gpu_available()我得到True 。

但是，當我在每次運行開始時import tensorflow as tf時，這就是我開始發生的事情。 這出現在終端中：

2020-05-08 10:07:48.506283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-05-08 10:07:48.506523: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvrtc.so.10.2: cannot open shared object file: No such file or directory
2020-05-08 10:07:48.506534: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-05-08 10:07:49.047809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-08 10:07:49.084978: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.085264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 10:07:49.085420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.085476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:07:49.086628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-08 10:07:49.086807: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-08 10:07:49.087975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-08 10:07:49.088620: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-08 10:07:49.088643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:07:49.088700: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.088997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.089251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0

后來，當實際的 model 訓練開始時，我得到：

2020-05-08 10:07:49.235606: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-08 10:07:49.258082: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2020-05-08 10:07:49.258706: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c2fe60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-08 10:07:49.258733: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-08 10:07:49.330241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.330585: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c1e240 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-08 10:07:49.330600: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-05-08 10:07:49.330749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 10:07:49.331057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.331065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:07:49.331072: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-08 10:07:49.331100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-08 10:07:49.331108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-08 10:07:49.331116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-08 10:07:49.331135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:07:49.331185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-08 10:07:49.331799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.332395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-08 10:07:49.332404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-05-08 10:07:49.332408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-05-08 10:07:49.332499: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.332793: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.333078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6381 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)

和

2020-05-08 10:08:04.498028: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:08:04.798897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:08:05.159827: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.161453: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.161572: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node model/conv1d/conv1d}}]]
2020-05-08 10:08:05.163161: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.163198: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2020-05-08 10:08:05.163233: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.
     [[{{node CudnnRNN}}]]

和

tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node model/conv1d/conv1d (defined at home/ivan/Documents/ML/projects/rnn/wtf_imputation/GAN-RNN_Timeseries-imputation/train.py:71) ]] [Op:__inference_train_on_batch_5414]

Failed to get convolution algorithm是我過去通過在訓練腳本開頭添加此塊來解決的問題：

import tensorflow as tf
# Solves Convolution CuDNN error
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

但這一次它不起作用，我真的不明白為什么。

編輯：

盡管它說我有 CUDA 10.2，但我實際上按照 TensorFlow 的要求安裝了 10.1 版本。 事實上，當我檢查nvcc --version我得到：

[...]
Cuda compilation tools, release 10.1, V10.1.243

所以我有10.1版本。 我不明白問題出在哪里。

Answer 1

1) 無法打開一些 TensorRT 庫。

您要么沒有安裝 TensorRT 庫（它們獨立於 Tensorflow 和 CUDA 並提供一些特定的 - 和可選的 - 加速功能。您現在可以放心地忽略這一點，查看如何安裝庫（在TF 的安裝頁面上）進一步有關如何安裝它們的信息。

2) 無法創建 cudnn 句柄：CUDNN_STATUS_INTERNAL_ERROR

這通常是由於未安裝 CuDNN 或版本錯誤造成的。 由於它說Successfully opened dynamic library libcudnn.so.7 ，我傾向於第二個選項。 檢查您安裝的版本是否與 Tensorflow 所需的版本相匹配（這可能比 NVIDIA 網站上提供的最新版本更舊）。

~~附帶說明一下，從您的日志看來，您似乎安裝了 CUDA 10.2。~~ Tensorflow 需要 10.1 版本，因此這可能是另一個問題來源。 如果是這種情況，您可以在系統上安裝 10.1 版和 10.2 版，或者卸載 10.2 並節省一些空間。

編輯：日志中的 10.2 指的是 TensorRT 庫，日志的 rest 列出了版本為 10.1 的庫，因此旁注可能是錯誤的。

TensorFlow 2.1 與 NVidia GPU 返回警告和錯誤：缺少庫、NUMA、卷積操作

問題描述

1 個解決方案

解決方案1
1 2020-05-08 08:41:03

1) 無法打開一些 TensorRT 庫。

2) 無法創建 cudnn 句柄：CUDNN_STATUS_INTERNAL_ERROR

TensorFlow 2.1 與 NVidia GPU 返回警告和錯誤：缺少庫、NUMA、卷積操作

問題描述

1 個解決方案

解決方案1 1 2020-05-08 08:41:03

1) 無法打開一些 TensorRT 庫。

2) 無法創建 cudnn 句柄：CUDNN_STATUS_INTERNAL_ERROR

解決方案1
1 2020-05-08 08:41:03