使用多個 gpu 和 tensorflow2.0 訓練得到錯誤：超出范圍：序列結束

Question

我正在使用帶有多個 GPU 的 tensorflow2.0 進行訓練。 它得到了以下錯誤。 但是如果我只使用一個 GPU 它運行沒有任何錯誤。 我的 tensorflow 版本是 tensorflow-gpu-2.0.0：

tensorflow.python.framework.errors_impl.CancelledError: 4 root error(s) found.
  (0) Cancelled:  Operation was cancelled
     [[{{node cond_6/else/_59/IteratorGetNext}}]]
  (1) Out of range:  End of sequence
     [[{{node cond_4/else/_37/IteratorGetNext}}]]
  (2) Out of range:  End of sequence
     [[{{node cond_7/else/_70/IteratorGetNext}}]]
     [[metrics/accuracy/div_no_nan/ReadVariableOp_6/_154]]
  (3) Out of range:  End of sequence
     [[{{node cond_7/else/_70/IteratorGetNext}}]]
0 successful operations.
1 derived errors ignored. [Op:__inference_distributed_function_83325]
Function call stack:
distributed_function -> distributed_function -> distributed_function -> distributed_function

這是我的代碼，您可以嘗試使用環境變量： CUDA_VISIBLE_DEVICES=0或CUDA_VISIBLE_DEVICES=0,1 。 這將得到不同的結果：

import tensorflow as tf
import tensorflow_datasets as tfds

data_name = 'uc_merced'
dataset = tfds.load(data_name)
train_data, test_data = dataset['train'], dataset['train']

def parse(img_dict):
    img = tf.image.resize_with_pad(img_dict['image'], 256, 256)
    label = img_dict['label']
    return img, label

train_data = train_data.map(parse)
train_data = train_data.batch(96)

test_data = test_data.map(parse)
test_data = test_data.batch(96)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    model = tf.keras.applications.ResNet50(weights=None, classes=21, input_shape=(256, 256, 3))
    model.compile(optimizer='adam',
            loss='sparse_categorical_crossentropy',
            metrics=['accuracy'])


model.fit(train_data, epochs=50, verbose=2, validation_data=test_data)
model.save('model/resnet_{}.h5'.format(data_name))

Answer 1

您可以嘗試以下操作，而不是使用CUDA_VISIBLE_DEVICES選擇 GPU：

strategy = tf.distribute.MirroredStrategy()
with strategy.scope(devices=["/gpu:0", "/gpu:1"]):

使用多個 gpu 和 tensorflow2.0 訓練得到錯誤：超出范圍：序列結束

問題描述

1 個解決方案

解決方案1
-1 2019-11-15 08:44:27

使用多個 gpu 和 tensorflow2.0 訓練得到錯誤：超出范圍：序列結束

問題描述

1 個解決方案

解決方案1 -1 2019-11-15 08:44:27

解決方案1
-1 2019-11-15 08:44:27