[英]Training with GPU very slow
我有一個包含 570,000 張圖像的數據集,該數據集分為訓練、驗證和測試,分別為 90%、5% 和 5%。
我開始使用 MobileNetV2 的遷移學習訓練模型。
正在加載的數據:
train_dataset = image_dataset_from_directory(
directory=TRAIN_DIR,
labels="inferred",
label_mode="categorical",
class_names=["0", "10", "5"],
image_size=SIZE,
seed=SEED,
subset=None,
interpolation="bilinear",
follow_links=False,
)
模型:
baseModel = MobileNetV2(
include_top=False,
input_shape=INPUT_SHAPE,
weights='imagenet')
headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(7, 7))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(512, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(3, activation="softmax")(headModel)
# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)
# loop over all layers in the base model and freeze them so they will
# *not* be updated during the training process
for layer in baseModel.layers:
layer.trainable = False
型號概要:
Total params: 2,915,395
Trainable params: 657,411
Non-trainable params: 2,257,984
我正在使用的 Nvidia K80 正在使用:
jupyter@tensorflow-4-vm:~$ nvidia-smi
Fri Sep 4 16:23:01 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 55C P0 58W / 149W | 10871MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 8129 C /opt/conda/bin/python 10858MiB |
+-----------------------------------------------------------------------------+
METRICS = [
TruePositives(name='tp'),
FalsePositives(name='fp'),
TrueNegatives(name='tn'),
FalseNegatives(name='fn'),
BinaryAccuracy(name='accuracy'),
Precision(name='precision'),
Recall(name='recall'),
AUC(name='auc'),
]
model.compile(optimizer=Adam(learning_rate=0.0001),
loss="categorical_crossentropy",
metrics=METRICS)
CALLBACKS = [
ReduceLROnPlateau(verbose=1),
ModelCheckpoint(
'/home/jupyter/checkpoint/model.{epoch:02d}-{val_loss:.2f}.hdf5',
verbose=1),
]
history = model.fit(train_dataset,epochs = 50,verbose=1, batch_size= 32, callbacks= CALLBACKS, validation_data=validation_dataset)
但是在單個 epoch 上訓練非常慢! 這么慢的原因可能是什么?
# Batch size = 32
Epoch 1/50
17/16229 [..............................] - ETA: 196:20:59 - loss: 1.2727 - tp: 169.0000 - fp: 211.0000 - tn: 877.0000 - fn: 375.0000 - accuracy: 0.6409 - precision: 0.4447 - recall: 0.3107 - auc: 0.5755
我認為數據加載可能是問題所在。 如果您通過網絡加載每個文件,則需要考慮的事情很少。 最好的方法是將數據復制到本地存儲,然后進行訓練。 如果這是不可能的,請嘗試使用 TFRecord 加載數據(您可以在此處查看如何使用它們: https ://www.tensorflow.org/tutorials/load_data/tfrecord)。 此外,請確保存儲和 VM 位於同一區域。
通過使用以下命令將數據集直接加載到 VM 實例來解決該問題:
gcloud compute scp /Users/yudhiesh/Desktop/frames_split.zip jupyter@tensorflow-5-vm:~
然后將文件夾解壓縮到 VM 實例的主目錄。
現在每個 epoch 的模型訓練時間不到一個小時。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.