Tensorflow GPU model.fit 崩潰 Z6FF9F4444AC481652F4412B5E16238

Question

I am trying to train a very basic model using tensorflow on the GPU (Spyder 4.1.5, Python 3.8.5, tensorflow 2.7.0).

它在 CPU 上運行良好，但如果我將設備設置為 GPU，則會崩潰。

這是代碼：

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

with tf.device('gpu:0'):
    history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

kernel 在 model.fit 崩潰，我得到的唯一 output 是：

" 2022 01:11:35.629875: I tensorflow/core/platform/cpu_feature_guard.cc:151] 這個 TensorFlow 二進制文件使用 oneAPI 深度神經網絡庫 (oneDNN) 進行了優化，以在性能關鍵操作中使用以下 CPU 指令：AVX AVX2 To在其他操作中啟用它們，使用適當的編譯器標志重建 TensorFlow。2022 01:11:36.017910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0 /device:GPU:0 with 3987 MB memory：-> 設備：0，名稱：NVIDIA GeForce GTX 1660 Ti，pci 總線 ID：0000:01:00.0，計算能力：7.5 "

我認為 tensorflow + cuda 應該正確安裝，因為我有另一個 model 在 Z52F9EC21730243AD9917 上工作正常。

有什么方法可以讓我獲得有關崩潰的更多信息嗎？ 難道是memory用完了？

GPU 執行期間來自 Spyder 控制台的圖片：

在崩潰之前，它會像這樣等待幾秒鍾：

之后我得到了這個：

Answer 1

我直接從 anaconda 提示符運行腳本。 這是cudnn的安裝問題。 這是引發的錯誤：

“I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] 已加載 cuDNN 版本 8302 無法加載庫 cudnn_cnn_infer64_8.dll。錯誤代碼 126 請確保 cudnn_cnn_infer64_8.Z064Z3332234 路徑！”

Answer 2

在大多數情況下，會出現 RAM 問題，因為您沒有提供正確類型的數據。 嘗試將您的數據集轉換為：

對於從列表到numpy.array的 X（數據），如下所示：

 import numpy as np array = np.asarray(array) # List could give you problems

對於從列表到分類的 Y（標簽）：

 from tensorflow.keras.utils import to_categorical y_tr = to_categorical(y_tr)

在您轉換 to_categorical 之前的情況下，您需要將您的 string_labels 轉換為 int_labels，以便：

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

至

class_names = [0,1,2,3,4,5,6,7,8,9]

然后在您的班級的 position 中添加 1 並在所有其他班級中添加 0 ，例如“馬” class 將是：

[0,0,0,0,0,0,0,1,0,0]

對於“汽車”：

[0,1,0,0,0,...,0]

Tensorflow GPU model.fit 崩潰 Z6FF9F4444AC481652F4412B5E16238

問題描述

2 個解決方案

解決方案1
0 2022-01-29 08:29:09

解決方案2
0 2022-07-23 12:31:39

Tensorflow GPU model.fit 崩潰 Z6FF9F4444AC481652F4412B5E16238

問題描述

2 個解決方案

解決方案1 0 2022-01-29 08:29:09

解決方案2 0 2022-07-23 12:31:39

解決方案1
0 2022-01-29 08:29:09

解決方案2
0 2022-07-23 12:31:39