简体   繁体   English

在多个GPU上进行训练会在Keras中导致NaN验证错误

[英]Training on Multiple GPUs causes NaN Validation Errors in Keras

I have a Keras model which trains fine on a single GPU but when I train it on multiple gpus all of the validation losses returned for training are NaNs. 我有一个Keras模型,可以在单个GPU上进行很好的训练,但是当我在多个GPU上进行训练时,返回的所有验证损失都是NaN。

I'm using a fit_generator and make a call to a validation generator. 我正在使用fit_generator并调用验证生成器。 The values returned for training losses and validation losses when training on one GPU are both valid and my model converges but on 2 or more GPUs the training losses are fine and valid but the validation losses are all NaNs. 当在一个GPU上进行训练时,返回的训练损失和验证损失值都是有效的,并且我的模型收敛了,但是在2个或更多GPU上,训练损失很好且有效,但验证损失均为NaN。 Is this a problem anyone has encountered before and does anyone have any advice on how to remedy the problem? 这是任何人以前遇到过的问题吗?是否有人对如何解决该问题有任何建议? I've tried the code on multiple computers each with different numbers and varities of Keras/Tensorflow compatible CUDA GPUs but to no avail. 我已经在多台计算机上尝试了该代码,每台计算机上使用的Keras / Tensorflow兼容CUDA GPU的数量和种类都不相同,但无济于事。 I'm able to successfully train on any computer though when using only one GPU. 即使仅使用一个GPU,我也可以在任何计算机上成功进行训练。

model = multi_gpu_model(Model(inputs=inputs, outputs=outputs),gpus=number_of_gpus, cpu_merge=True, cpu_relocation=False)

            hist = model.fit_generator(generator=training_generator,
                                               callbacks=callbacks,
                                               max_queue_size=max_queue_size,
                                               steps_per_epoch=steps_per_epoch,
                                               workers=number_of_workers,
                                               validation_data = validation_generator,
                                               validation_steps=validation_steps,
                                               shuffle=False)

My expectation was that the model would return valid validation losses but instead every single validation loss is NaN so I can't accurately benchmark my training on a multiple GPU machine which is incredibly inconvenient because I'm looking to accelerate my training speed. 我的期望是该模型将返回有效的验证损失,但每个验证损失都为NaN,因此我无法在多GPU机器上准确地对我的训练进行基准测试,这非常不方便,因为我想提高自己的训练速度。

As far as I can (heuristically) tell, when doing distributed training/evaluation, the number of elements in the dataset must be evenly divisible by the batch size and number of GPUs. 据我所知(启发式地),在进行分布式训练/评估时,数据集中的元素数量必须能被批处理大小和GPU数量均匀地整除。 That is, nelements / ngpus / batch_size == 0 . 也就是说, nelements / ngpus / batch_size == 0 If that is not the case, then empty batches will be passed to the loss function, which, depending on the loss function, may inject NaN losses into the aggregator. 如果不是这种情况,则空批次将传递给损失函数,根据损失函数,可能会将NaN损失注入聚合器。

(In the comments, the OP mentioned that their batch size is evenly divisible by the number of GPUs, which is not the same as the number of elements being divisible by the number of GPUs and batch size.) (在评论中,OP提到它们的批处理大小可被GPU的数量均匀地整除,这与可被GPU的数量和批处理大小的可整除的元素数不同。)

I've encountered this problem writing a custom Keras model and using TF2 nightly. 我在编写自定义Keras模型并每晚使用TF2时遇到了此问题。 My workaround (which has solved my problem) is to modify any loss functions so that they explicitly check the size of the batch. 我的解决方法(已解决了我的问题)是修改所有损失函数,以便它们显式检查批处理的大小。 Eg assuming some error function named fn : 例如,假设某些错误函数名为fn

def loss(y_true, y_pred):
    err = fn(y_true, y_pred)
    loss = tf.cond(
        tf.size(y_pred) == 0,
        lambda: 0.,
        lambda: tf.math.reduce_mean(err)
    )
return loss

Another workaround would be to truncate the dataset. 另一个解决方法是截断数据集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM