自定義模型的 Keras 多 GPU 模型失敗

Question

我有一個在 ImageNet 上訓練的簡單 CNN 模型。 我使用 keras.utils.multi_gpu_model 進行多 GPU 訓練。 它工作正常，但在嘗試訓練基於相同骨干網絡的 SSD 模型時遇到問題。 它在主干頂部有自定義損失和幾個自定義層：

model, predictor_sizes, input_encoder = build_model(input_shape=(args.img_height, args.img_width, 3),                                                                                                                                   
                                                    n_classes=num_classes, mode='training')                                                                                                                                             

optimizer = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)                                                                                                                                                          
loss = SSDMultiBoxLoss(neg_pos_ratio=3, alpha=1.0)                                                                                                                                                                                      

if args.num_gpus > 1:                                                                                                                                                                                                                   
    model = multi_gpu_model(model, gpus=args.num_gpus)                                                                                                                                                                                  
model.compile(optimizer=optimizer, loss=loss.compute_loss)                                                                                                                                                                              
model.summary()

在num_gpus==1情況下，我有以下摘要：

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 512, 512, 3)  0                                            
__________________________________________________________________________________________________
conv1_pad (Lambda)              (None, 516, 516, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 256, 256, 16) 1216        conv1_pad[0][0]                  
__________________________________________________________________________________________________
conv1_bn (BatchNormalization)   (None, 256, 256, 16) 64          conv1[0][0]                      
__________________________________________________________________________________________________
conv1_relu (Activation)         (None, 256, 256, 16) 0           conv1_bn[0][0]                   
__________________________________________________________________________________________________

....
                                                                 det_ctx6_2_mbox_loc_reshape[0][0]
__________________________________________________________________________________________________
mbox_priorbox (Concatenate)     (None, None, 8)      0           det_ctx1_2_mbox_priorbox_reshape[
                                                                 det_ctx2_2_mbox_priorbox_reshape[
                                                                 det_ctx3_2_mbox_priorbox_reshape[
                                                                 det_ctx4_2_mbox_priorbox_reshape[
                                                                 det_ctx5_2_mbox_priorbox_reshape[
                                                                 det_ctx6_2_mbox_priorbox_reshape[
__________________________________________________________________________________________________
mbox (Concatenate)              (None, None, 33)     0           mbox_conf_softmax[0][0]          
                                                                 mbox_loc[0][0]                   
                                                                 mbox_priorbox[0][0]              
==================================================================================================
Total params: 1,890,510
Trainable params: 1,888,366
Non-trainable params: 2,144

但是，在多 GPU 情況下，我可以看到所有中間層都打包在model ：

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 512, 512, 3)  0                                            
__________________________________________________________________________________________________
lambda (Lambda)                 (None, 512, 512, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 512, 512, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
model (Model)                   (None, None, 33)     1890510     lambda[0][0]                     
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
mbox (Concatenate)              (None, None, 33)     0           model[1][0]                      
                                                                 model[2][0]                      
==================================================================================================
Total params: 1,890,510
Trainable params: 1,888,366
Non-trainable params: 2,144

訓練運行正常，但我無法加載以前預訓練的權重：

model.load_weights(args.weights, by_name=True)

因為錯誤：

ValueError: Layer #3 (named "model") expects 150 weight(s), but the saved weights have 68 element(s).

當然，預訓練模型只有主干的權重，而不是對象檢測模型的其余部分。

任何人都可以幫助我理解：

為什么所有的中間層都被打包進了Lambda層？
為什么分類模型不會發生這種情況
我怎樣才能克服“模型打包”或加載這種模型的預訓練權重？

注意：我正在使用 tf.Keras，它現在是 Tensorflow 的一部分。

Answer 1

您可以在構建后立即加載權重，然后再轉換為多 GPU 對應物。 或者，您可以為單 GPU 和多 GPU 版本使用兩個對象，並使用第一個加載權重，然后使用第二個進行訓練。

Answer 2

在編譯您的多 GPU 模型時，嘗試將結果模型返回到一個新的變量，例如“model_multiGPU”，然后在使用您在 multi_gpu_model 函數中輸入的原始模型訓練負載權重后，這將解決問題。

自定義模型的 Keras 多 GPU 模型失敗

問題描述

2 個解決方案

解決方案1
0 2019-01-19 17:56:53

解決方案2
0 2019-12-08 08:58:33

自定義模型的 Keras 多 GPU 模型失敗

問題描述

2 個解決方案

解決方案1 0 2019-01-19 17:56:53

解決方案2 0 2019-12-08 08:58:33

解決方案1
0 2019-01-19 17:56:53

解決方案2
0 2019-12-08 08:58:33