Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

Question

I am trying to train a very basic model using tensorflow on the GPU (Spyder 4.1.5, Python 3.8.5, tensorflow 2.7.0). I am trying to train a very basic model using tensorflow on the GPU (Spyder 4.1.5, Python 3.8.5, tensorflow 2.7.0).

It works fine on the CPU, but crashes if I set the device to GPU.它在 CPU 上运行良好，但如果我将设备设置为 GPU，则会崩溃。

This is the code:这是代码：

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()

train_images, test_images = train_images / 255.0, test_images / 255.0

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

with tf.device('gpu:0'):
    history = model.fit(train_images, train_labels, epochs=10, 
                    validation_data=(test_images, test_labels))

The kernel crashes at model.fit and the only output I get is: kernel 在 model.fit 崩溃，我得到的唯一 output 是：

" 2022 01:11:35.629875: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance‑critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022 01:11:36.017910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3987 MB memory: ‑> device: 0, name: NVIDIA GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5 " " 2022 01:11:35.629875: I tensorflow/core/platform/cpu_feature_guard.cc:151] 这个 TensorFlow 二进制文件使用 oneAPI 深度神经网络库 (oneDNN) 进行了优化，以在性能关键操作中使用以下 CPU 指令：AVX AVX2 To在其他操作中启用它们，使用适当的编译器标志重建 TensorFlow。2022 01:11:36.017910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0 /device:GPU:0 with 3987 MB memory：-> 设备：0，名称：NVIDIA GeForce GTX 1660 Ti，pci 总线 ID：0000:01:00.0，计算能力：7.5 "

I think tensorflow + cuda should be installed correctly, as I have another model trained on the GPU which works fine.我认为 tensorflow + cuda 应该正确安装，因为我有另一个 model 在 Z52F9EC21730243AD9917 上工作正常。

Is there any way I could get some more information about the crash?有什么方法可以让我获得有关崩溃的更多信息吗？ Could it be that it runs out of memory?难道是memory用完了？

Pictures from the Spyder console during the GPU execution: GPU 执行期间来自 Spyder 控制台的图片：

Before the crash it waits like this for some seconds:在崩溃之前，它会像这样等待几秒钟：

And afterwards I get this:之后我得到了这个：

Answer 1

I ran the script directly from the anaconda prompt.我直接从 anaconda 提示符运行脚本。 It was an installation problem with the cudnn.这是cudnn的安装问题。 This was the raised error:这是引发的错误：

"I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8302 Could not load library cudnn_cnn_infer64_8.dll. Error code 126 Please make sure cudnn_cnn_infer64_8.dll is in your library path!" “I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] 已加载 cuDNN 版本 8302 无法加载库 cudnn_cnn_infer64_8.dll。错误代码 126 请确保 cudnn_cnn_infer64_8.Z064Z3332234 路径！”

Answer 2

In most of the case, RAM problems are given because you're not giving the right type of Data.在大多数情况下，会出现 RAM 问题，因为您没有提供正确类型的数据。 Try convert your Dataset in:尝试将您的数据集转换为：

For X (Data) from list to numpy.array as following:对于从列表到numpy.array的 X（数据），如下所示：
```
 import numpy as np array = np.asarray(array) # List could give you problems
```
For Y (Labels) from list to categorical :对于从列表到分类的 Y（标签）：
```
 from tensorflow.keras.utils import to_categorical y_tr = to_categorical(y_tr)
```

In your case before that you convert to_categorical, you need to convert your string_labels to int_labels so:在您转换 to_categorical 之前的情况下，您需要将您的 string_labels 转换为 int_labels，以便：

class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']

to至

class_names = [0,1,2,3,4,5,6,7,8,9]

and then categorical that add 1 in the position of your classes and 0 in all the others so for example "horse" class will be:然后在您的班级的 position 中添加 1 并在所有其他班级中添加 0 ，例如“马” class 将是：

[0,0,0,0,0,0,0,1,0,0]

for "automobile":对于“汽车”：

[0,1,0,0,0,...,0]

Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

问题描述

2 个解决方案

解决方案1
0 2022-01-29 08:29:09

解决方案2
0 2022-07-23 12:31:39

Tensorflow GPU model.fit 崩溃 Z6FF9F4444AC481652F4412B5E16238

问题描述

2 个解决方案

解决方案1 0 2022-01-29 08:29:09

解决方案2 0 2022-07-23 12:31:39

解决方案1
0 2022-01-29 08:29:09

解决方案2
0 2022-07-23 12:31:39