简体   繁体   English

Tensorflow:序列对TFRecord的编码和解码是否重要

[英]Tensorflow: Does sequence matter in encoding and decoding a TFRecord

I have some practice data that I want to encode to a TFRecord format and then decode to tf.features in Tensorflow. 我有一些练习数据,我想将其编码为TFRecord格式,然后在tf.features中解码为tf.features。 My question is very basic, but I could not find a clear answer to this. 我的问题很基本,但是我找不到明确的答案。

Question : Do I need to decode the features in a dataset in the same sequence as they are encoded? 问题 :是否需要按照与编码相同的顺序对数据集中的特征进行解码? In other words, I can't seem to find a way to reference features by field name in a TFRecord. 换句话说,我似乎找不到在TFRecord中通过字段名称引用要素的方法。 This is really important for 2 reasons. 这很重要,原因有两个。

  1. I just wanted to get my assumption validated, so that I know how to avoid breaking my code in the future. 我只是想验证我的假设,以便我知道如何避免将来破坏我的代码。 Here is some simple code, though this is not a complete example. 这是一些简单的代码,尽管这不是完整的示例。
  2. Python makes a big deal about dictionaries being un-ordered . Python对于字典是无序的有很多意义 So how can I guarantee sequence when I am using a data structure that is supposed to be unordered? 那么,当我使用应该是无序的数据结构时,如何保证顺序呢? I was not sure if this was handled in some way that I don't know about. 我不确定这是否以我不知道的某种方式处理。

To encode data into TFRecord format, you can do something like: 要将数据编码为TFRecord格式,可以执行以下操作:

#Fields in Dataframe: ['DIVISION','SPORDER','PUMA','REGION']

df = pd.DataFrame(...)
with tf.python_io.TFRecordWriter('myfile.tfrecord') as writer:

    for row in df.itertuples():
        example = tf.train.Example(features=tf.train.Features(feature={
          'feat/division': tf.train.Feature(int64_list=tf.train.Int64List(value=row.DIVISION)),
          'label/sporder': tf.train.Feature(int64_list=tf.train.Int64List(value=row.SPORDER)),
          'feat/puma': tf.train.Feature(bytes_list=tf.train.BytesList(value=[row.PUMA])),
          'feat/region': tf.train.Feature(bytes_list=tf.train.BytesList(value=[row.REGION]))))
        writer.write(example.SerializeToString())

Then to ingest the dataset you would need something like the code below. 然后,要摄取数据集,您将需要类似下面的代码。 Notice that the fields are referenced again in order. 请注意,再次按顺序引用了这些字段。 NOTE: I used the same dictionary keys in the TFRecords versus the decoded form, but I don't think that is necessary--just a convenience. 注意:我在TFRecords和解码格式中使用了相同的字典键,但是我认为这不是必需的-只是为了方便。 I was not sure if that is the way things have to be? 我不确定这是否是必须的方式吗? Meaning, 含义,

dataset = tf.data.TFRecordDataset('myfile.tfrecord')
dataset = dataset.map(_parse_function)

def _parse_function(example_proto):
    features = {'feat/division': tf.FixedLenFeature((), tf.string, default_value=""),
                'label/sporder': tf.FixedLenFeature((), tf.int64, default_value=0),
                'feat/puma': tf.VarLenFeature(dtype=tf.string),
                'feat/region': tf.VarLenFeature(dtype=tf.string)}

    parsed_example = tf.parse_single_example(example_proto, features)
    parsed_label = parsed_example.pop("label/sporder", None)


    return parsed_example, parsed_label

The tfrecord format uses protobuf for serialization of the struct. tfrecord格式使用protobuf对结构进行序列化。 You can think about it as a binary json/xml format. 您可以将其视为二进制json / xml格式。 Json/xml and protobuf don't care about the order of the fields. Json / xml和protobuf并不关心字段的顺序。 So, the order of the feature definitions is not important. 因此,特征定义的顺序并不重要。 It's the same in your snippet because it was just convenient for reading. 您的摘要中的内容相同,因为它很方便阅读。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM