'tf.data()' throwing 你的输入数据用完了；中断训练

Question

I'm seeing weird issues when trying to use tf.data() to generate data in batches with keras api.尝试使用tf.data()使用 keras api 批量生成数据时，我看到了奇怪的问题。 It keeps throwing errors saying it's running out of training_data.它不断抛出错误，说它的training_data用完了。

TensorFlow 2.1 TensorFlow 2.1

import numpy as np
import nibabel
import tensorflow as tf
from tensorflow.keras.layers import Conv3D, MaxPooling3D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras import Model
import os
import random


"""Configure GPUs to prevent OOM errors"""
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

"""Retrieve file names"""
ad_files = os.listdir("/home/asdf/OASIS/3D/ad/")
cn_files = os.listdir("/home/asdf/OASIS/3D/cn/")

sub_id_ad = []
sub_id_cn = []

"""OASIS AD: 178 Subjects, 278 3T MRIs"""
"""OASIS CN: 588 Subjects, 1640 3T MRIs"""
"""Down-sampling CN to 278 MRIs"""
random.Random(129).shuffle(ad_files)
random.Random(129).shuffle(cn_files)

"""Split files for training"""
ad_train = ad_files[0:276]
cn_train = cn_files[0:276]

"""Shuffle Train data and Train labels"""
train = ad_train + cn_train
labels = np.concatenate((np.ones(len(ad_train)), np.zeros(len(cn_train))), axis=None)
random.Random(129).shuffle(train)
random.Random(129).shuffle(labels)
print(len(train))
print(len(labels))

"""Change working directory to OASIS/3D/all/"""
os.chdir("/home/asdf/OASIS/3D/all/")

"""Create tf data pipeline"""


def load_image(file, label):
    nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata())

    xs, ys, zs = np.where(nifti != 0)
    nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
    nifti = nifti[0:100, 0:100, 0:100]
    nifti = np.reshape(nifti, (100, 100, 100, 1))
    nifti = tf.convert_to_tensor(nifti, np.float64)
    return nifti, label


@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, labels):
    return tf.py_function(load_image, [file, labels], [tf.float64, tf.float64])


dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.shuffle(6, 129)
dataset = dataset.repeat(50)
dataset = dataset.map(load_image_wrapper, num_parallel_calls=6)
dataset = dataset.batch(6)
dataset = dataset.prefetch(buffer_size=1)
iterator = iter(dataset)
batch_images, batch_labels = iterator.get_next()

########################################################################################
with tf.device("/cpu:0"):
    with tf.device("/gpu:0"):
        model = tf.keras.Sequential()

        model.add(Conv3D(64,
                         input_shape=(100, 100, 100, 1),
                         data_format='channels_last',
                         kernel_size=(7, 7, 7),
                         strides=(2, 2, 2),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:1"):
        model.add(Conv3D(64,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:2"):
        model.add(Conv3D(128,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

        model.add(MaxPooling3D(pool_size=(2, 2, 2),
                               padding='valid'))

        model.add(Flatten())

        model.add(Dense(256, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))


model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer=tf.keras.optimizers.Adagrad(0.01),
              metrics=['accuracy'])


########################################################################################
model.fit(batch_images, batch_labels, steps_per_epoch=92, epochs=50)

After creating the dataset, I'm shuffling and adding the repeat parameter to the num_of_epochs , ie 50 in this case.创建数据集后，我正在改组并将重复参数添加到num_of_epochs ，即在这种情况下为 50。 This works, but it crashes after the 3rd epoch, and I can't seem to figure out what I'm doing wrong in this particular instance.这有效，但它在第 3 个纪元之后崩溃，我似乎无法弄清楚在这个特定实例中我做错了什么。 Am I supossed to declare the repeat and shuffle statements at the top of the pipeline?我是否可以在管道顶部声明重复和随机播放语句？

Here is the error:这是错误：

Epoch 3/50
92/6 [============================================================================================================================================================================================================================================================================================================================================================================================================================================================================] - 3s 36ms/sample - loss: 0.1902 - accuracy: 0.8043
Epoch 4/50
5/6 [========================>.....] - ETA: 0s - loss: 0.2216 - accuracy: 0.80002020-03-06 15:18:17.804126: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[BiasAddGrad_3/_54]]
2020-03-06 15:18:17.804137: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[sequential/conv3d_3/Conv3D/ReadVariableOp/_21]]
2020-03-06 15:18:17.804140: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Conv3DBackpropFilterV2_3/_68]]
2020-03-06 15:18:17.804263: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[sequential/dense/MatMul/ReadVariableOp/_30]]
2020-03-06 15:18:17.804364: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[BiasAddGrad_5/_62]]
2020-03-06 15:18:17.804561: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 4600 batches). You may need to use the repeat() f24/6 [========================================================================================================================] - 1s 36ms/sample - loss: 0.1673 - accuracy: 0.8750
Traceback (most recent call last):
  File "python_scripts/gpu_farm/tf_data_generator/3D_tf_data_generator.py", line 181, in <module>
    evaluation_ad = model.evaluate(ad_test, ad_test_labels, verbose=0)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 930, in evaluate
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 490, in evaluate
    use_multiprocessing=use_multiprocessing, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 426, in _model_iteration
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 646, in _process_inputs
    x, y, sample_weight=sample_weights)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2383, in _standardize_user_data
    batch_size=batch_size)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2489, in _standardize_tensors
    y, self._feed_loss_fns, feed_output_shapes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 810, in check_loss_and_target_compatibility
    ' while using as loss `' + loss_name + '`. '
ValueError: A target array with shape (5, 2) was passed for an output of shape (None, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.

Update: So model.fit() should be supplied with model.fit(x=data, y=labels) , when using tf.data() because of a weird problem.更新：因此model.fit()应该与model.fit(x=data, y=labels) ，因为一个奇怪的问题，当使用tf.data()时。 This removes the list out of index error.这将删除list out of index错误。 And now I'm back to my original error.现在我又回到了最初的错误。 However it looks like this could be a tensorflow problem: https://github.com/tensorflow/tensorflow/issues/32但是看起来这可能是一个 tensorflow 问题： https : //github.com/tensorflow/tensorflow/issues/32

So when I increase the batch size from 6 to higher numbers, and decrease the steps_per_epoch , it goes through more epochs without throwing the StartAbort: Out of range errors因此，当我将批量大小从 6 增加到更高的数字，并减少steps_per_epoch ，它会经历更多的时期而不会抛出StartAbort: Out of range错误

Update2: As per @jkjung13 suggestion, model.fit() takes one parameter when using a dataset, model.fit(x=batch) .更新2：根据@jkjung13 建议， model.fit()在使用数据集时采用一个参数model.fit(x=batch) 。 This is the correct implementation.这是正确的实现。

But, you are supposed to supply the dataset instead of an iterable object if you're only using the x parameter in model.fit() .但是，如果您只在model.fit()使用x参数，则应该提供dataset而不是可迭代对象。

So, it should be: model.fit(dataset, epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))所以，它应该是： model.fit(dataset, epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))

And with that I get a new error: GitHub Issue有了这个，我得到了一个新错误： GitHub 问题

Now to overcome this, I'm converting the dataset to a numpy_iterator(): model.fit(dataset.as_numpy_iterator(), epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))现在为了克服这个问题，我将数据集转换为 numpy_iterator()： model.fit(dataset.as_numpy_iterator(), epochs=50, steps_per_epoch=46, validation_data=(v, v_labels))

This solves the problem, however, the performance is appaling, similar to the old keras model.fit_generator without multiprocessing.这解决了问题，但是，性能令人震惊，类似于没有多处理的旧 keras model.fit_generator 。 So this defeats the whole purpose of 'tf.data'.所以这违背了“tf.data”的全部目的。

Answer 1

TF 2.1 TF 2.1

This is now working with the following parameters:现在使用以下参数：

def load_image(file, label):
    nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata()).astype(np.float32)

    xs, ys, zs = np.where(nifti != 0)
    nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
    nifti = nifti[0:100, 0:100, 0:100]
    nifti = np.reshape(nifti, (100, 100, 100, 1))
    return nifti, label


@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, label):
    return tf.py_function(load_image, [file, label], [tf.float64, tf.float64])


dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.map(load_image_wrapper, num_parallel_calls=32)
dataset = dataset.prefetch(buffer_size=1)
dataset = dataset.apply(tf.data.experimental.prefetch_to_device('/device:GPU:0', 1))

# So, my dataset size is 522, i.e. 522 MRI images.
# I need to load the entire dataset as a batch.
# This should exceed 60GiBs of RAM, but it doesn't go over 12GiB of RAM.
# I'm not sure how tf.data batch() stores the data, maybe a custom file?
# And also add a repeat parameter to iterate with each epoch.
dataset = dataset.batch(522, drop_remainder=True).repeat()

# Now initialise an iterator
iterator = iter(dataset)

# Create two objects, x & y, from batch
batch_image, batch_label = iterator.get_next()

##################################################################################
with tf.device("/cpu:0"):
    with tf.device("/gpu:0"):
        model = tf.keras.Sequential()

        model.add(Conv3D(64,
                         input_shape=(100, 100, 100, 1),
                         data_format='channels_last',
                         kernel_size=(7, 7, 7),
                         strides=(2, 2, 2),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:1"):
        model.add(Conv3D(64,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:2"):
        model.add(Conv3D(128,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

        model.add(MaxPooling3D(pool_size=(2, 2, 2),
                               padding='valid'))

        model.add(Flatten())

        model.add(Dense(256, activation='relu'))
        model.add(Dropout(0.7))
        model.add(Dense(1, activation='sigmoid'))

model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer=tf.keras.optimizers.Adagrad(0.01),
              metrics=['accuracy'])
##################################################################################

# Now supply x=batch_image, y= batch_label to Keras' model.fit()
# And finally, supply your batchs_size here!
model.fit(batch_image, batch_label, epochs=100, batch_size=12)

##################################################################################

With this, it takes around 8 Minutes for the training to start.这样，训练开始需要大约 8 分钟。 But once training starts, I'm seeing incredible speeds!但是一旦开始训练，我就会看到令人难以置信的速度！

Epoch 30/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.3526 - accuracy: 0.8640
Epoch 31/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.3334 - accuracy: 0.8448
Epoch 32/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.3308 - accuracy: 0.8697
Epoch 33/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2936 - accuracy: 0.8755
Epoch 34/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2935 - accuracy: 0.8851
Epoch 35/100
522/522 [==============================] - 14s 28ms/sample - loss: 0.3157 - accuracy: 0.8889
Epoch 36/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2910 - accuracy: 0.8851
Epoch 37/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2810 - accuracy: 0.8697
Epoch 38/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2536 - accuracy: 0.8966
Epoch 39/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2506 - accuracy: 0.9004
Epoch 40/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.2353 - accuracy: 0.8927
Epoch 41/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2336 - accuracy: 0.9042
Epoch 42/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2243 - accuracy: 0.9234
Epoch 43/100
522/522 [==============================] - 15s 29ms/sample - loss: 0.2181 - accuracy: 0.9176

15 seconds per epoch compared to the old 12 minutes per epoch!与旧的每 epoch 12 分钟相比，每 epoch 15 秒！

I will do further testing to see if this is actually working, and what impact it has on my test data.我会做进一步的测试，看看这是否真的有效，以及它对我的测试数据有什么影响。 If there are any errors, I will come back and update this post.如果有任何错误，我会回来更新这篇文章。

Why does this work?为什么这样做？ I have no idea.我不知道。 I couldn't find anything in the Keras documentation.我在 Keras 文档中找不到任何内容。

'tf.data()' throwing 你的输入数据用完了；中断训练

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-03-08 14:35:44

&#39;tf.data()&#39; throwing 你的输入数据用完了； 中断训练

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-03-08 14:35:44

'tf.data()' throwing 你的输入数据用完了；中断训练

解决方案1
0 已采纳 2020-03-08 14:35:44