Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

Question

I am trying to train a very basic model using tensorflow on the GPU (Spyder 4.1.5, Python 3.8.5, tensorflow 2.7.0).

它在 CPU 上运行良好，但如果我将设备设置为 GPU，则会崩溃。

这是代码：

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

with tf.device('gpu:0'):
    history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

kernel 在 model.fit 崩溃，我得到的唯一 output 是：

" 2022 01:11:35.629875: I tensorflow/core/platform/cpu_feature_guard.cc:151] 这个 TensorFlow 二进制文件使用 oneAPI 深度神经网络库 (oneDNN) 进行了优化，以在性能关键操作中使用以下 CPU 指令：AVX AVX2 To在其他操作中启用它们，使用适当的编译器标志重建 TensorFlow。2022 01:11:36.017910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0 /device:GPU:0 with 3987 MB memory：-> 设备：0，名称：NVIDIA GeForce GTX 1660 Ti，pci 总线 ID：0000:01:00.0，计算能力：7.5 "

我认为 tensorflow + cuda 应该正确安装，因为我有另一个 model 在 Z52F9EC21730243AD9917 上工作正常。

有什么方法可以让我获得有关崩溃的更多信息吗？ 难道是memory用完了？

GPU 执行期间来自 Spyder 控制台的图片：

在崩溃之前，它会像这样等待几秒钟：

之后我得到了这个：

Answer 1

我直接从 anaconda 提示符运行脚本。 这是cudnn的安装问题。 这是引发的错误：

“I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] 已加载 cuDNN 版本 8302 无法加载库 cudnn_cnn_infer64_8.dll。错误代码 126 请确保 cudnn_cnn_infer64_8.Z064Z3332234 路径！”

Answer 2

在大多数情况下，会出现 RAM 问题，因为您没有提供正确类型的数据。 尝试将您的数据集转换为：

对于从列表到numpy.array的 X（数据），如下所示：

 import numpy as np array = np.asarray(array) # List could give you problems

对于从列表到分类的 Y（标签）：

 from tensorflow.keras.utils import to_categorical y_tr = to_categorical(y_tr)

在您转换 to_categorical 之前的情况下，您需要将您的 string_labels 转换为 int_labels，以便：

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

至

class_names = [0,1,2,3,4,5,6,7,8,9]

然后在您的班级的 position 中添加 1 并在所有其他班级中添加 0 ，例如“马” class 将是：

[0,0,0,0,0,0,0,1,0,0]

对于“汽车”：

[0,1,0,0,0,...,0]

Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

问题描述

2 个解决方案

解决方案1
0 2022-01-29 08:29:09

解决方案2
0 2022-07-23 12:31:39

Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

问题描述

2 个解决方案

解决方案1 0 2022-01-29 08:29:09

解决方案2 0 2022-07-23 12:31:39

解决方案1
0 2022-01-29 08:29:09

解决方案2
0 2022-07-23 12:31:39