简体   繁体   English

运行 run_squad.py 以微调 Google BERT 模型(官方 tensorflow 预训练模型)时无法加载(恢复)TensorFlow 检查点

[英]Failed to load(restore) TensorFlow checkpoint when running run_squad.py to fine-tune the Google BERT model(official tensorflow pre-trained model)

I am new to deep learning and NLP, and now trying to get started with the pre-trained Google BERT model.我是深度学习和 NLP 的新手,现在正在尝试开始使用预训练的 Google BERT 模型。 Since I intended to build a QA system with BERT, I decided to start from the SQuAD related fine-tuning.由于打算用BERT搭建QA系统,所以决定从SQuAD相关的微调入手。

I followed the instructions from README.md in the official Google BERT GitHub repository .我按照官方 Google BERT GitHub 存储库中README.md 的说明进行操作。

I typed the code as following:我输入的代码如下:

export BERT_BASE_DIR=/home/bert/Dev/venv/uncased_L-12_H-768_A-12/
export SQUAD_DIR=/home/bert/Dev/venv/squad
python run_squad.py \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --do_train=True \
  --train_file=$SQUAD_DIR/train-v1.1.json \
  --do_predict=True \
  --predict_file=$SQUAD_DIR/dev-v1.1.json \
  --train_batch_size=12 \
  --learning_rate=3e-5 \
  --num_train_epochs=2.0 \
  --max_seq_length=384 \
  --doc_stride=128 \
  --output_dir=/tmp/squad_base/

and after minutes(when the training started), I got this:几分钟后(训练开始时),我得到了这个:

a lot of output omitted
INFO:tensorflow:start_position: 53
INFO:tensorflow:end_position: 54
INFO:tensorflow:answer: february 1848
INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num orig examples = 87599
INFO:tensorflow:  Num split examples = 88641
INFO:tensorflow:  Batch size = 12
INFO:tensorflow:  Num steps = 14599
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow:  name = end_positions, shape = (12,)
INFO:tensorflow:  name = input_ids, shape = (12, 384)
INFO:tensorflow:  name = input_mask, shape = (12, 384)
INFO:tensorflow:  name = segment_ids, shape = (12, 384)
INFO:tensorflow:  name = start_positions, shape = (12,)
INFO:tensorflow:  name = unique_ids, shape = (12,)
INFO:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/bert/Dev/venv/uncased_L-12_H-768_A-12//bert_model.ckpt
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
  File "run_squad.py", line 1283, in <module>
    tf.app.run()
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run_squad.py", line 1215, in main
    estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train
    rendezvous.raise_errors()
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
    saving_listeners=saving_listeners
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn
    features, labels, mode, config)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn
    features, labels, is_export_mode=is_export_mode)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu
    return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn
    estimator_spec = self._model_fn(features=features, **kwargs)
  File "run_squad.py", line 623, in model_fn
    ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
  File "/home/bert/Dev/venv/bert/modeling.py", line 330, in get_assignment_map_from_checkpoint
    init_vars = tf.train.list_variables(init_checkpoint)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 95, in list_variables
    reader = load_checkpoint(ckpt_dir_or_file)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 314, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern), status)
  File "/home/bert/Dev/venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 526, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /home/bert/Dev/venv/uncased_L-12_H-768_A-12//bert_model.ckpt

It seems that tensorflow failed to find the checkpoint file, but as far as i know about it, a tensorflow checkpoint "file" is actually three files, and this is correct way to call it(with the path and prefix).似乎tensorflow找不到检查点文件,但据我所知,tensorflow检查点“文件”实际上是三个文件,这是调用它的正确方法(带路径和前缀)。

I am placing files in the right place, I believe:我将文件放在正确的位置,我相信:

(venv) bert@bert-System-Product-Name:~/Dev/venv/uncased_L-12_H-768_A-12$ pwd
/home/bert/Dev/venv/uncased_L-12_H-768_A-12
(venv) bert@bert-System-Product-Name:~/Dev/venv/uncased_L-12_H-768_A-12$ ls
bert_config.json  bert_model.ckpt.data-00000-of-00001  bert_model.ckpt.index  bert_model.ckpt.meta  vocab.txt

I am running on Ubuntu 16.04 LTS , with NVIDIA GTX 1080 Ti (CUDA 9.0) , with Anaconda python 3.5 distribution , with tensorflow-gpu 1.11.0 in a virtual environment.我在 Ubuntu 16.04 LTS 上运行,使用 NVIDIA GTX 1080 Ti (CUDA 9.0),使用 Anaconda python 3.5 发行版,在虚拟环境中使用 tensorflow-gpu 1.11.0。

I am expecting the code to run smoothly and start training(fine-tune) since it is official code and I got the files placed as instructions.我希望代码能够顺利运行并开始训练(微调),因为它是官方代码,并且我将文件作为说明放置。

I am answering my own question.我正在回答我自己的问题。

I have just solved the problem by simply remove the slash( / ) in the $BERT_BASE_DIR , so the variable changed from '/home/bert/Dev/venv/uncased_L-12_H-768_A-12/' to '/home/bert/Dev/venv/uncased_L-12_H-768_A-12' .我刚刚通过简单地删除$BERT_BASE_DIR中的斜杠( / )解决了这个问题,因此变量从'/home/bert/Dev/venv/uncased_L-12_H-768_A-12/'更改为'/home/bert/Dev/venv/uncased_L-12_H-768_A-12'

So there is no more double slash in the prefix "/home/bert/Dev/venv/uncased_L-12_H-768_A-12//bert_model.ckpt" .所以前缀"/home/bert/Dev/venv/uncased_L-12_H-768_A-12//bert_model.ckpt"不再有双斜杠。

It seems that single slash or double slash are considered different by checkpoint restore functions in tensorflow since I believe that bash interprets them as identical.似乎单斜杠或双斜杠被张量流中的检查点恢复函数认为是不同的,因为我相信 bash 将它们解释为相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Tensorflow Hub 和 JS:如何微调预训练模型并将其导出以在 Tensorflow.js 中使用? - Tensorflow Hub and JS: how to fine-tune a pre-trained model and export it for using in Tensorflow.js? 无法加载 tensorflow BERT 预训练模型 - Failed to load tensorflow BERT pre-trained model 如何加载由 Google 命名为 inception 的预训练张量流模型? - How to load pre-trained tensorflow model named inception by Google? 恢复预训练模型的Tensorflow检查点文件 - Restoring Tensorflow checkpoint files of a pre-trained model 无法加载预训练的 model 检查点与 TensorFlow Object 检测 ZDB974238714CA8DE634A7ACE1 - Unable to load pre-trained model checkpoint with TensorFlow Object Detection API 如何使用Github和CheckPoint文件中预先训练的Tensorflow模型进行推理 - How to run inference using pre-trained Tensorflow model from their Github and CheckPoint file 如何将预训练的张量流模型加载并预测到Java代码中? - How to load and predict a pre-trained tensorflow model into Java code? TensorFlow 2.0 C++ - 加载预训练 model - TensorFlow 2.0 C++ - Load pre-trained model 如何微调嵌入层中的预训练嵌入? - How to fine-tune pre-trained embeddings in embedding layer? 当使用TensorFlow slim调整预训练模型时,如何知道要排除或训练的范围? - How to know which scopes to exclude or to train when fine tuning a pre-trained model with TensorFlow slim?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM