Resnet 模型训练时间过长

Question

I am using this tutorial to learn transfer learning for my model.我正在使用本教程为我的模型学习迁移学习。 As we can see that his single epoch was of 1 sec average.正如我们所看到的，他的单个 epoch 平均为 1 秒。

Epoch 1/100
1080/1080 [==============================] - 10s 10ms/step - loss: 3.6862 - acc: 0.2000
Epoch 2/100
1080/1080 [==============================] - 1s 1ms/step - loss: 3.0746 - acc: 0.2574
Epoch 3/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.6839 - acc: 0.3185
Epoch 4/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.3929 - acc: 0.3583
Epoch 5/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.1382 - acc: 0.3870
Epoch 6/100
1080/1080 [==============================] - 1s 1ms/step - loss: 1.7810 - acc: 0.4593

But when I am following almost same code for my cifar model, my single epoch is taking about 1 hour to run.但是当我为我的 cifar 模型遵循几乎相同的代码时，我的单个 epoch 需要大约 1 小时才能运行。

Train on 50000 samples
 3744/50000 [=>............................] - ETA: 43:38 - loss: 3.3223 - acc: 0.1760
1

My Code is我的代码是

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
from tensorflow.keras import Model

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

base_model = ResNet50(weights= None, include_top=False, input_shape= (32,32,3))

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
predictions = Dense(10 , activation= 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

hist = model.fit(x_train, y_train)

Note that I am using cifar 10 dataset for this model.请注意，我正在为此模型使用 cifar 10 数据集。 Is there anything wrong with my code, or with my data?我的代码或我的数据有什么问题吗？ How can I improve this?我该如何改进？ 1 epoch taking 1 hour is way to long. 1 个 epoch 需要 1 小时，时间太长了。 I have NVIDIA MX-110 2GB too, which ofc TensorFlow is using.我也有 NVIDIA MX-110 2GB，这是 TensorFlow 正在使用的。

Answer 1

It doesn't look like you batch your data.看起来您没有批量处理数据。 As a consequence, each forward pass of the model is only seeing one training instance, which is very inefficient.结果，模型的每次前向传递只看到一个训练实例，这是非常低效的。

Try setting the batch size in your model.fit() call:尝试在您的 model.fit() 调用中设置批量大小：

hist = model.fit(x_train, y_train, batch_size=16, epochs=num_epochs, 
                 validation_data=(x_test, y_test), shuffle=True)

Tune your batch size so it's the largest that can fit in your GPU's memory - try a few different values before settling on one.调整您的批量大小，使其成为可以容纳您的 GPU 内存的最大数量 - 在确定一个之前尝试几个不同的值。

Answer 2

I copied and ran your code but in order to get it to run I had to make the changes below我复制并运行了您的代码，但为了让它运行，我必须进行以下更改

import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
from tensorflow.keras import Model

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
print (len(x_train))
x_train = x_train / 255.0
x_test = x_test / 255.0

y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

base_model = ResNet50(weights= None, include_top=False, input_shape= (32,32,3))

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
predictions = Dense(10 , activation= 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

hist = model.fit(x_train, y_train, )
# the result for 2 epochs is shown below
50000
Epoch 1/2
1563/1563 [==============================] - 58s 37ms/step - loss: 2.8654 - acc: 0.2537
Epoch 2/2
1563/1563 [==============================] - 51s 33ms/step - loss: 2.5331 - acc: 0.2748

Per model.fit documentation if you do not specify the batch size it defaults to 32. So with 50,000 samples/32=1563 steps.根据 model.fit 文档，如果您未指定批次大小，则默认为 32。因此，有 50,000 个样本/32=1563 步。 For some reason in your code the batch size defaulted to 1. I do not know why.由于某些原因，在您的代码中批量大小默认为 1。我不知道为什么。 So set the batch_size=50 and then you will require 1000 steps.所以设置batch_size=50，然后你需要1000步。 To speed things up more I would set the weights="imagenet" and freeze the layers in the base model with为了加快速度，我会设置 weights="imagenet" 并使用以下命令冻结基本模型中的层

for layer in base_model.layers:
    layer.trainable = False
#if you set batch_size=50, weights="imagenet" with the base model frozen you get
50000
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
94773248/94765736 [==============================] - 5s 0us/step
Epoch 1/2
1000/1000 [==============================] - 16s 16ms/step - loss: 2.5101 - acc: 0.1487
Epoch 2/2
1000/1000 [==============================] - 10s 10ms/step - loss: 2.1159 - acc: 0.2249

Resnet 模型训练时间过长

问题描述

2 个解决方案

解决方案1
1 2020-09-25 17:49:50

解决方案2
1 已采纳 2020-09-26 06:47:29

Resnet 模型训练时间过长

问题描述

2 个解决方案

解决方案1 1 2020-09-25 17:49:50

解决方案2 1 已采纳 2020-09-26 06:47:29

解决方案1
1 2020-09-25 17:49:50

解决方案2
1 已采纳 2020-09-26 06:47:29