[英]How can I pick specific records in TensorFlow from a .tfrecords file?
My goal is to train a neural net for a fixed number of epochs or steps, I would like each step to use a batch of data of a specific size from a .tfrecords file.我的目标是为固定数量的时期或步骤训练神经网络,我希望每个步骤都使用来自 .tfrecords 文件的一批特定大小的数据。
Currently I am reading from the file using this loop:目前我正在使用这个循环从文件中读取:
i = 0
data = np.empty(shape=[x,y])
for serialized_example in tf.python_io.tf_record_iterator(filename):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], # more features here]
if i == 3:
break
i = i + 1
print data # do some stuff etc.
I am a bit of a Python noob, and I suspect that creating "i" outside the loop and breaking out when it reaches a certain value is just a hacky word-around.我有点 Python 菜鸟,我怀疑在循环外创建“i”并在它达到某个值时爆发只是一个粗俗的说法。
Is there a way that I can read data from the file but specify "I would like the first 100 values in the byte_list that is contained within the Labels feature" and then subsequently "I would like the next 100 values".有没有办法可以从文件中读取数据,但指定“我想要包含在标签功能中的 byte_list 中的前 100 个值”,然后指定“我想要接下来的 100 个值”。
To clarify, the thing that I am unfamiliar with is looping over a file in this manner, I am not really certain how to manipulate the loop.澄清一下,我不熟悉的是以这种方式循环文件,我不确定如何操作循环。
Thanks.谢谢。
Impossible.不可能。 TFRecords is a streaming reader and has no random access. TFRecords 是一个流媒体阅读器,没有随机访问。
A TFRecords file represents a sequence of (binary) strings. TFRecords 文件表示一系列(二进制)字符串。 The format is not random access, so it is suitable for streaming large amounts of data but not suitable if fast sharding or other non-sequential access is desired.该格式不是随机访问,因此适用于流式传输大量数据,但不适用于需要快速分片或其他非顺序访问的情况。
If you'd like to use enumerate()
to break out from a loop at a certain iteration, you could do the following:如果您想使用enumerate()
在某个迭代中从循环中跳出,您可以执行以下操作:
n = 5 # Iteration you would like to stop at
data = np.empty(shape=[x,y])
for i, serialized_example in enumerate(tf.python_io.tf_record_iterator(filename)):
example = tf.train.Example()
example.ParseFromString(serialized_example)
Labels = example.features.feature['Labels'].byte_list.value
# Some more features here
data[i-1] = [Labels[0], Labels[1]]# more features here
if i == n:
break
print(data)
.tfrecords
解决您的.tfrecords
用例I would like each step to use a batch of data of a specific size from a .tfrecords file.我希望每一步都使用来自 .tfrecords 文件的一批特定大小的数据。
As mentioned by TimZaman , .tfrecords are not meant for arbitrary access of data.正如TimZaman所提到的, .tfrecords 并不用于任意访问数据。 But seeing as you just need to continously pull batches from the .tfrecords
file, you might be better off using the tf.data
API to feed your model.但是看到您只需要从.tfrecords
文件中连续提取批次,您最好使用tf.data
API 来提供您的模型。
Adapted from the the tf.data guide :改编自tf.data 指南:
Constructing a Dataset
from .tfrecord
files从.tfrecord
文件构建Dataset
filepath1 = '/path/to/file.tfrecord'
filepath2 = '/path/to/another_file.tfrecord
dataset = tf.data.TFRecordDataset(filenames = [filepath1, filepath2])
From here, if you're using the tf.keras API, you could pass dataset
as an argument into model.fit
like so:从这里开始,如果您使用 tf.keras API,您可以将dataset
作为参数传递给model.fit
如下所示:
model.fit(x = dataset,
batch_size = None,
validation_data = some_other_dataset)
Here's a blog which helps to explain .tfrecord
files a little better than the tensorflow documentation.这是一个博客,它有助于比 tensorflow 文档更好地解释.tfrecord
文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.