如何使用tf.train.Saver（）方法recover_last_checkpoints？

Question

The documentation writes that a list of checkpoint paths should be passed to it, but how to get the list? 文档写道应该将检查点路径列表传递给它，但如何获取列表？ By hard coding? 通过硬编码？ No, it's a silly practice. 不，这是一个愚蠢的做法。 By parsing the protocol buffer file (a file named as checkpoint in your model directory)? 通过解析协议缓冲区文件（模型目录中名为checkpoint的文件）？ But tensorflow does not implement a parser, does it? 但是tensorflow没有实现解析器，是吗？ So do I have to implement one by myself? 所以我必须自己实施一个吗？ Do you have a good practice to get the checkpoint paths list? 获得检查点路径列表是否有良好的做法？

I raise this question because these days I am troubled by one thing. 我提出这个问题，因为这些天我被一件事困扰。 As you know, a days-long training may crash for some reason, and I have to recover it from the latest checkpoint. 如您所知，由于某种原因，为期数天的培训可能会崩溃，我必须从最新的检查点恢复。 Recovering training is easy, since I just need to write the following code: 恢复培训很简单，因为我只需要编写以下代码：

restorer = tf.train.Saver()
restorer.restore(sess, latest_checkpoint)

I can hard code latest_checkpoint , or somewhat wiser, use tf.train.latest_checkpoint() . 我可以使用tf.train.latest_checkpoint()来硬编码latest_checkpoint ，或者稍微更聪明一些。

However, a problem arises after I recover the training. 但是，在我恢复训练后出现问题。 Those old checkpoints files that are created before crash are left there. 那些在崩溃之前创建的旧检查点文件留在那里。 The Saver only manages the checkpoint files created in one run. Saver仅管理在一次运行中创建的检查点文件。 I hope it could also manage the previously created checkpoints files so they would be automatically deleted, and I don't have to manually delete them every time. 我希望它也可以管理以前创建的检查点文件，这样它们就会被自动删除，而且我不必每次都手动删除它们。 I think such repeating work is really silly. 我认为这种重复工作真的很傻。

Then I find the recover_last_checkpoints method in class tf.train.Saver() , which allows Saver to manage old checkpoints. 然后我在类tf.train.Saver()找到recover_last_checkpoints方法，它允许Saver管理旧的检查点。 But it's not handy to use. 但它使用起来并不方便。 So is there any good solution? 那么有什么好的解决方案吗？

Answer 1

As mentioned by @isarandi in a comment, the easiest way is to first recover all checkpoint paths using get_checkpoint_state‌ followed by all_model_checkpoi‌nt_paths , which is basically an undocumented feature. 正如@isarandi在评论中提到的，最简单的方法是首先使用get_checkpoint_state‌然后all_model_checkpoi‌nt_paths恢复所有检查点路径，这基本上是一个未记录的功能。 You can then restore your latest state as such: 然后，您可以恢复最新状态：

states = tf.train.get_checkpoint_state‌(your_checkpoint_dir‌)
checkpoint_paths = states.all_model_checkpoi‌nt_paths
saver.recover_last_checkpoints(checkpoint_paths)

如何使用tf.train.Saver（）方法recover_last_checkpoints？

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-07-13 11:51:12

如何使用tf.train.Saver（）方法recover_last_checkpoints？

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-07-13 11:51:12

解决方案1
1 已采纳 2017-07-13 11:51:12