简体   繁体   English

Tensorflow 1.10 TFRecordDataset-恢复TFRecords

[英]Tensorflow 1.10 TFRecordDataset - recovering TFRecords

Notes: 笔记:

  1. this question extends upon a previous question of mine . 这个问题是我先前的一个问题的延伸。 In that question I ask about the best way to store some dummy data as Example and SequenceExample seeking to know which is better for data similar to dummy data provided. 在这个问题中,我询问了一些存储虚拟数据的最佳方法,例如ExampleSequenceExample试图了解哪种数据更适合提供的虚拟数据。 I provide both explicit formulations of the Example and SequenceExample construction as well as, in the answers, a programatic way to do so. 我同时提供了ExampleSequenceExample结构的明确表述,并在回答中提供了一种编程的方式。

  2. Because this is still a lot of code, I am providing a Colab (interactive jupyter notebook hosted by google) file where you can try the code out yourself to assist. 由于这仍然是很多代码,因此我提供了Colab (由Google托管的交互式jupyter笔记本)文件,您可以在其中尝试自己的代码以提供帮助。 All the necessary code is there and it is generously commented. 所有必要的代码均已存在,并对其进行了慷慨的注释。

I am trying to learn how to convert my data into TF Records as the claimed benefits are worthwhile for my data. 我正在尝试学习如何将我的数据转换为TF记录,因为声称的好处对于我的数据是值得的。 However, the documentation leaves a lot to be desired and the tutorials / blogs (that I have seen) which try to go deeper, really only touch the surface or rehash the sparse docs that exist. 但是,文档还有很多需要改进的地方,而试图深入学习的教程/博客(我见过)实际上只是接触表面或重新整理了现有的稀疏文档。

For the demo data considered in my previous question - as well as here - I have written a decent class that takes: 对于我之前的问题以及此处的演示数据,我编写了一个不错的类,该类需要:

  • a sequence with n channels (in this example it is integer based, of fixed-length and with n channels) 具有n个通道的序列(在此示例中,它是基于整数的,具有固定长度的n个通道)
  • soft-labeled class probabilities (in this example there are n classes and float based) 软标签的类概率(在此示例中,有n个类和基于浮点的)
  • some meta data (in this example a string and two floats) 一些元数据(在此示例中为字符串和两个浮点数)

and can encode the data in 1 of 6 forms: 并可以采用以下6种形式之一对数据进行编码:

  1. Example, with sequence channels / classes separate in a numeric type ( int64 in this case) with meta data tacked on 例如,序列通道/类以数字类型(在这种情况下为int64 )分开,并附加了元数据
  2. Example, with sequence channels / classes separate as a byte string (via numpy.ndarray.tostring() ) with meta data tacked on 例如,将序列通道/类分隔为字节字符串(通过numpy.ndarray.tostring() ),并添加元数据
  3. Example, with sequence / classes dumped as byte string with meta data tacked on 示例,将序列/类作为字节字符串转储,并附加了元数据

  4. SequenceExample, with sequence channels / classes separate in a numeric type and meta data as context SequenceExample,序列通道/类以数字类型分开,元数据作为上下文

  5. SequenceExample, with sequence channels separate as a byte string and meta data as context SequenceExample,其中序列通道作为字节字符串分离,而元数据作为上下文
  6. SequenceExample, with sequence and classes dumped as byte string and meta data as context SequenceExample,序列和类作为字节字符串转储,而元数据作为上下文转储

This works fine. 这很好。

In the Colab I show how to write dummy data all in the same file as well as in separate files. Colab中,我演示了如何将虚拟数据全部写入同一文件以及不同文件中。

My question is how can I recover this data? 我的问题是如何恢复这些数据?

I given 4 attempts at trying to do so in the linked file. 我在链接文件中做了4次尝试。

Why is TFReader under a different sub-package from TFWriter? 为什么TFReader与TFWriter处于不同的子软件包下?

Solved by updating the features to include shape information and remembering that SequenceExample are unnamed FeatureLists . 通过更新要素以包括形状信息并记住SequenceExample未命名的 FeatureLists

context_features = {
    'Name' : tf.FixedLenFeature([], dtype=tf.string),
    'Val_1': tf.FixedLenFeature([], dtype=tf.float32),
    'Val_2': tf.FixedLenFeature([], dtype=tf.float32)
}

sequence_features = {
    'sequence': tf.FixedLenSequenceFeature((3,), dtype=tf.int64),
    'pclasses'  : tf.FixedLenSequenceFeature((3,), dtype=tf.float32),
}

def parse(record):
  parsed = tf.parse_single_sequence_example(
        record,
        context_features=context_features,
        sequence_features=sequence_features
  )
  return parsed


filenames = [os.path.join(os.getcwd(),f"dummy_sequences_{i}.tfrecords") for i in range(3)]
dataset = tf.data.TFRecordDataset(filenames).map(lambda r: parse(r))

iterator = tf.data.Iterator.from_structure(dataset.output_types,
                                           dataset.output_shapes)
next_element = iterator.get_next()

training_init_op = iterator.make_initializer(dataset)

for _ in range(2):
  # Initialize an iterator over the training dataset.
  sess.run(training_init_op)
  for _ in range(3):
    ne = sess.run(next_element)
    print(ne)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM