model 训练在第一个 epoch 完成后卡住了……第二个 epoch 甚至不会开始，也不会抛出任何错误，它只是保持空闲

Question

screenshot showing the model training stuck at epoch 1 without throwing error屏幕截图显示 model 训练卡在 epoch 1 没有抛出错误

I am using google colab pro and here is my code snippet我正在使用 google colab pro，这是我的代码片段

batch_size = 32
img_height = 256
img_width = 256

train_datagen = ImageDataGenerator(rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.2) # set validation split

train_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='training') # set as training data

validation_generator = train_datagen.flow_from_directory(
data_dir, # same directory as training data
target_size=(img_height, img_width),
batch_size=batch_size,
class_mode='categorical',
subset='validation') # set as validation data

Found 12442 images belonging to 14 classes.
Found 3104 images belonging to 14 classes.

num_classes = 14

model =Sequential()
chanDim = -1
model.add(Conv2D(16, 3, padding='same', activation='relu', input_shape=(img_height,img_width,3)))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(3, 3)))
model.add(Dropout(0.25))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(BatchNormalization(axis=chanDim))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

with tf.device('/device:GPU:0'):
model.summary()


Total params: 58,091,918
Trainable params: 58,089,070
Non-trainable params: 2,848

model.compile(optimizer='adam',
          loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
          metrics=['accuracy'])

checkpoint_path = "/content/drive/MyDrive/model_checkpoints"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                             save_weights_only=True,
                                             verbose=1)

epochs=10
history=model.fit(
     train_generator,
     steps_per_epoch = train_generator.samples // batch_size,
     validation_data = validation_generator, 
     validation_steps = validation_generator.samples // batch_size,
     epochs = epochs,
     callbacks=[cp_callback])

tensorflow version-2.4.1 keras version-2.4.0 tensorflow 版本-2.4.1 keras 版本-2.4.0

I am using around 15k image dataset and 58k parameters for training.我正在使用大约 15k 图像数据集和 58k 参数进行训练。 I used image data generator too but when try training the model it completes its first epochs but 2nd epoch won't start it gets stuck but it doesn't throw any error it just stays idle.我也使用了图像数据生成器，但是当尝试训练 model 时，它完成了它的第一个 epoch，但第二个 epoch 不会开始它卡住但它不会抛出任何错误，它只是保持空闲状态。

Answer 1

I found that because of the large dataset and 60k params the validation set took so long in model training at first epoch because of default verbose I didn't saw that... so what I did is that I reduced my image size from 260 260 to 180 180 which reduced my params to 29 k from 60k and trained my model again but this time I waited for 30 mins for the我发现由于数据集很大和 60k 参数，验证集在第一个时期的 model 训练中花费了很长时间，因为默认的冗长我没有看到......所以我所做的是我将图像大小从 260 260到 180 180 这将我的参数从 60k 减少到 29k 并再次训练了我的 model 但这次我等了 30 分钟 validation set (which I can't see the info because of verbose 1 default)) after the training set is completed.训练集完成后的验证集（由于默认为verbose 1，我看不到信息）。 In the image attached you can see it says 5389 secs (89 mins) for first epochs but its only training dataset time it didn't add up validation time which took about 30 mins for it...so if u see ur model stuck after the training dataset..just wait because validation data will be executed.....or use verbose =2在附加的图像中，您可以看到它说第一个时期为 5389 秒（89 分钟），但它唯一的训练数据集时间并没有加起来验证时间，这大约需要 30 分钟......所以如果你看到你的 model 卡在之后训练数据集......等待，因为验证数据将被执行......或使用详细 = 2

model 训练在第一个 epoch 完成后卡住了……第二个 epoch 甚至不会开始，也不会抛出任何错误，它只是保持空闲

问题描述

1 个解决方案

解决方案1
0 2021-03-25 14:52:58

model 训练在第一个 epoch 完成后卡住了……第二个 epoch 甚至不会开始，也不会抛出任何错误，它只是保持空闲

问题描述

1 个解决方案

解决方案1 0 2021-03-25 14:52:58

解决方案1
0 2021-03-25 14:52:58