简体   繁体   English

如何从 .t​​frecords 文件中选择 TensorFlow 中的特定记录?

[英]How can I pick specific records in TensorFlow from a .tfrecords file?

My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.我的目标是为固定数量的时期或步骤训练神经网络,我希望每个步骤都使用来自 .tfrecords 文件的一批特定大小的数据。

Currently I am reading from the file using this loop:目前我正在使用这个循环从文件中读取:

i = 0
data = np.empty(shape=[x,y])

for serialized_example in tf.python_io.tf_record_iterator(filename):

    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    Labels = example.features.feature['Labels'].byte_list.value
    # Some more features here

    data[i-1] = [Labels[0], # more features here]

    if i == 3:
        break
    i = i + 1

print data # do some stuff etc.

I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.我有点 Python 菜鸟,我怀疑在循环外创建“i”并在它达到某个值时爆发只是一个粗俗的说法。

Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".有没有办法可以从文件中读取数据,但指定“我想要包含在标签功能中的 byte_list 中的前 100 个值”,然后指定“我想要接下来的 100 个值”。

To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.澄清一下,我不熟悉的是以这种方式循环文件,我不确定如何操作循环。

Thanks.谢谢。

Impossible.不可能。 TFRecords is a streaming reader and has no random access. TFRecords 是一个流媒体阅读器,没有随机访问。

A TFRecords file represents a sequence of (binary) strings. TFRecords 文件表示一系列(二进制)字符串。 The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.该格式不是随机访问,因此适用于流式传输大量数据,但不适用于需要快速分片或其他非顺序访问的情况。

Expanding on the comment by Shan Carter (although it's not an ideal solution for your question) for archival purposes.出于存档目的,扩展了Shan Carter的评论(尽管它不是您问题的理想解决方案)。

If you'd like to use enumerate() to break out from a loop at a certain iteration, you could do the following:如果您想使用enumerate()在某个迭代中从循环中跳出,您可以执行以下操作:

n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])

for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):

    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    Labels = example.features.feature['Labels'].byte_list.value
    # Some more features here

    data[i-1] = [Labels[0], Labels[1]]# more features here

    if i == n:
       break

print(data) 

Addressing your use case for .tfrecords解决您的.tfrecords用例

I would like each step to use a batch of data of a specific size from a .tfrecords file.我希望每一步都使用来自 .tfrecords 文件的一批特定大小的数据。

As mentioned by TimZaman , .tfrecords are not meant for arbitrary access of data.正如TimZaman所提到的, .tfrecords 并不用于任意访问数据。 But seeing as you just need to continously pull batches from the .tfrecords file, you might be better off using the tf.data API to feed your model.但是看到您只需要从.tfrecords文件中连续提取批次,您最好使用tf.data API 来提供您的模型。

Adapted from the the tf.data guide :改编自tf.data 指南

Constructing a Dataset from .tfrecord files.tfrecord文件构建Dataset

filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])

From here, if you're using the tf.keras API, you could pass dataset as an argument into model.fit like so:从这里开始,如果您使用 tf.keras API,您可以将dataset作为参数传递给model.fit如下所示:

model.fit(x = dataset,
          batch_size = None,
          validation_data = some_other_dataset)

Extra Stuff额外的东西

Here's a blog which helps to explain .tfrecord files a little better than the tensorflow documentation.这是一个博客,它有助于比 tensorflow 文档更好地解释.tfrecord文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM