简体   繁体   English

在扩展Tensor2Tensor Transformer TPU教程中出现OutOfRangeError

[英]OutOfRangeError in scaling up Tensor2Tensor Transformer TPU tutorial

I followed the T2T Transformer "Train a language model" example and it worked for 10 training step. 我遵循了T2T变形金刚“训练语言模型”的示例,它适用于10个培训步骤。 However, when scaling up to 250,000 steps I get an OutOfRange error (below). 但是,当扩展到25万步时,出现OutOfRange错误(如下)。 Is this a problem with parsing or something else? 这是解析问题还是其他问题?

INFO:tensorflow:Init TPU system
INFO:tensorflow:Starting infeed thread controller.
INFO:tensorflow:Starting outfeed thread controller.
INFO:tensorflow:Enqueue next (100) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (100) batch(es) of data from outfeed.
WARNING:tensorflow:

Error occurred during infeed/outfeed.  This may be due to a compile error in the main session.  Waiting for a short time for the main session to come back.
  End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

Caused by op 'input_pipeline_task0/while/IteratorGetNext', defined at:
  File "/usr/local/bin/t2t-trainer", line 32, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/local/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/usr/local/lib/python3.5/dist-packages/tensor2tensor/bin/t2t_trainer.py", line 359, in main
    execute_schedule(exp)
  ...
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 729, in enqueue_ops_fn
    features, labels = inputs.features_and_labels()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2766, in features_and_labels
    return _Inputs._parse_inputs(self._iterator.get_next())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 373, in get_next
    name=name)), self._output_types,
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1745, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

OutOfRangeError (see above for traceback): End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

ERROR:tensorflow:Feed error: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence
         [[Node: input_pipeline_task0/while/IteratorGetNext = IteratorGetNext[_class=["loc:@input_pipeline_task0/while/InfeedQueue/split/4"], output_shapes=[[64,1], [64,256,1,1], [64,256], [64,256], [64,256,1,1]], output_types=[DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32], _device="/job:tpu_worker/replica:0/task:0/device:CPU:0"](input_pipeline_task0/while/IteratorGetNext/Enter, ^input_pipeline_task0/while/Identity)]]

During handling of the above exception, another exception occurred:
...

  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Step was cancelled by an explicit call to `Session::Close()`.

I assume you've follow the instructions in this document . 我假设您已按照本文档中的说明进行操作。 The relevant error from the output is the line for “OutOfRangeError” on “End of sequence”. 输出中的相关错误是“序列结束”上“ OutOfRangeError”的行。 This error is the signal used by input pipeline to let upper stream know that there is no more data to process. 此错误是输入管道用于让上游知道没有更多数据要处理的信号。

You need to ensure there is data for the TPU to process by making sure of the following: The TPU has access to training data (eg GCS bucket) There is no typo in the paths in the command, and most importantly, That your data set is either large or you have a dataset.repeat() to ensure your training data doesn't run out before your TPU has completed the configured number of training steps. 您需要通过确保以下各项来确保有供TPU处理的数据:TPU可以访问训练数据(例如GCS存储桶)命令中的路径中没有错字,最重要的是,您的数据集要么很大,要么您有一个dataset.repeat(),以确保您的训练数据不会在TPU完成配置的训练步骤数之前用完。

one of the authors of the Tensor2Tensor library here. Tensor2Tensor库的作者之一。

Short answer: reduce --eval_steps . 简短的答案:减少--eval_steps

Long answer: 长答案:

Unfortunately, the TPUEstimator , the library we use under the hood to run on TPU, does not catch OutOfRangeError when you run out of input data. 不幸的是,当您用完输入数据时,我们在TPUEstimator使用的TPUEstimator库无法捕获OutOfRangeError During training it's not a problem because the input data is infinite (we call repeat on the input tf.data.Dataset ). 在训练期间,这不是问题,因为输入数据是无限的(我们在输入tf.data.Dataset上调用repeat)。 However, during evaluation, you want to do 1 pass over the data, which means that you need to set --eval_steps correctly so that you don't exhaust the input data. 但是,在评估期间,您需要对数据进行1次传递,这意味着您需要正确设置--eval_steps ,以免耗尽输入数据。 Hopefully TPUEstimator will soon handle catching the error so that you don't have to figure out how many eval steps you have to run. 希望TPUEstimator会尽快处理该错误,以便您不必弄清楚必须运行多少个评估步骤。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM