簡體   English   中英

TensorFlow 2.1 與 NVidia GPU 返回警告和錯誤:缺少庫、NUMA、卷積操作

[英]TensorFlow 2.1 with NVidia GPU returning Warnings and Errors on: missing libraries, NUMA, Convolution operation

我正在嘗試在tensorflow 2.1.0中訓練神經網絡。 我已經安裝了所有必要的軟件來配置我的 NVidia RTX 2070 GPU。 事實上,當我輸入: tf.test.is_gpu_available()我得到True

但是,當我在每次運行開始時import tensorflow as tf時,這就是我開始發生的事情。 這出現在終端中:

2020-05-08 10:07:48.506283: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-05-08 10:07:48.506523: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvrtc.so.10.2: cannot open shared object file: No such file or directory
2020-05-08 10:07:48.506534: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-05-08 10:07:49.047809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-08 10:07:49.084978: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.085264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 10:07:49.085420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.085476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:07:49.086628: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-08 10:07:49.086807: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-08 10:07:49.087975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-08 10:07:49.088620: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-08 10:07:49.088643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:07:49.088700: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.088997: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.089251: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0

后來,當實際的 model 訓練開始時,我得到:

2020-05-08 10:07:49.235606: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-08 10:07:49.258082: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599990000 Hz
2020-05-08 10:07:49.258706: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c2fe60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-08 10:07:49.258733: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-05-08 10:07:49.330241: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.330585: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5c1e240 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-05-08 10:07:49.330600: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2070, Compute Capability 7.5
2020-05-08 10:07:49.330749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.44GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2020-05-08 10:07:49.331057: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.331065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:07:49.331072: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-08 10:07:49.331100: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-08 10:07:49.331108: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-08 10:07:49.331116: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-08 10:07:49.331135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:07:49.331185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.331778: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-05-08 10:07:49.331799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-05-08 10:07:49.332395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-05-08 10:07:49.332404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-05-08 10:07:49.332408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-05-08 10:07:49.332499: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.332793: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-08 10:07:49.333078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6381 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)

2020-05-08 10:08:04.498028: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-08 10:08:04.798897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-08 10:08:05.159827: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.161453: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.161572: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node model/conv1d/conv1d}}]]
2020-05-08 10:08:05.163161: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-05-08 10:08:05.163198: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at cudnn_rnn_ops.cc:1510 : Unknown: Fail to find the dnn implementation.
2020-05-08 10:08:05.163233: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Fail to find the dnn implementation.
     [[{{node CudnnRNN}}]]

tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node model/conv1d/conv1d (defined at home/ivan/Documents/ML/projects/rnn/wtf_imputation/GAN-RNN_Timeseries-imputation/train.py:71) ]] [Op:__inference_train_on_batch_5414]

Failed to get convolution algorithm是我過去通過在訓練腳本開頭添加此塊來解決的問題:

import tensorflow as tf
# Solves Convolution CuDNN error
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:



盡管它說我有 CUDA 10.2,但我實際上按照 TensorFlow 的要求安裝了 10.1 版本。 事實上,當我檢查nvcc --version我得到:

Cuda compilation tools, release 10.1, V10.1.243

所以我有10.1版本。 我不明白問題出在哪里。

1) 無法打開一些 TensorRT 庫。

您要么沒有安裝 TensorRT 庫(它們獨立於 Tensorflow 和 CUDA 並提供一些特定的 - 和可選的 - 加速功能。您現在可以放心地忽略這一點,查看如何安裝庫(在TF 的安裝頁面上)進一步有關如何安裝它們的信息。


這通常是由於未安裝 CuDNN 或版本錯誤造成的。 由於它說Successfully opened dynamic library libcudnn.so.7 ,我傾向於第二個選項。 檢查您安裝的版本是否與 Tensorflow 所需的版本相匹配(這可能比 NVIDIA 網站上提供的最新版本更舊)。

附帶說明一下,從您的日志看來,您似乎安裝了 CUDA 10.2。 Tensorflow 需要 10.1 版本,因此這可能是另一個問題來源。 如果是這種情況,您可以在系統上安裝 10.1 版和 10.2 版,或者卸載 10.2 並節省一些空間。

編輯:日志中的 10.2 指的是 TensorRT 庫,日志的 rest 列出了版本為 10.1 的庫,因此旁注可能是錯誤的。


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM