简体   繁体   中英

OutOfRangeError: tensorflow iterator not reinitializing between runs

I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.Dataset API. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:

Caught OutOfRangeError. Stopping Training. End of sequence
     [[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13)  = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():

I created a function to feed in hard coded batches as the result of get_batch , and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.

Code to train model:

with tf.Graph().as_default():

    tf.logging.set_verbosity(tf.logging.INFO)
    images, labels = get_batch(filenames=tf_train_record_path+train_file)
    # Create the model, use the default arg scope to configure the batch norm parameters.
    with slim.arg_scope(inception.inception_v1_arg_scope()):
        logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)

    # Specify the loss function:
    tf.losses.mean_squared_error(labels,logits)
    total_loss = tf.losses.get_total_loss()
    tf.summary.scalar('losses/Total_Loss', total_loss)


     # Specify the optimizer and create the train op:
    optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
    train_op = slim.learning.create_train_op(total_loss, optimizer)

    # Run the training:
    final_loss = slim.learning.train(
        train_op,
        logdir=train_dir,
        init_fn=get_init_fn(),
        number_of_steps=1)

Code to get batches using Dataset

def get_batch(filenames):
    dataset = tf.data.TFRecordDataset(filenames=filenames)

    dataset = dataset.map(parse)
    dataset = dataset.batch(2)

    iterator = dataset.make_one_shot_iterator()
    data_X, data_y = iterator.get_next()

    return data_X, data_y 

This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!

Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices() . Also your parse function may cause problems. Try to print your image/label with sess.run .

And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM