[英]What should I do if I want to use large datasets that can't load into the memory with TensorFlow?
I want to use a large dataset that cannot load into the memory once to train a model with TensorFlow. 我想使用一个无法加载到内存中的大型数据集来训练TensorFlow模型。 But I don't know what exacty I should do.
但我不知道应该做些什么。
I have read some great posts about TFRecords
file format and the official document. 我已经阅读了一些关于
TFRecords
文件格式和官方文档的精彩帖子。 Bus I still can't figure it out. 巴士我还是想不出来。
Is there a complete solution plan with TensorFlow? TensorFlow是否有完整的解决方案计划?
Consider using tf.TextLineReader
which in conjunction with tf.train.string_input_producer
allows you to load data from multiple files on disk (if your dataset is large enough that it needs to be spread out into multiple files). 考虑使用
tf.TextLineReader
,它与tf.train.string_input_producer
一起允许您从磁盘上的多个文件加载数据(如果您的数据集足够大,需要将其分散到多个文件中)。
See https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files 请参阅https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files
Code snippet from the link above: 上面链接中的代码段:
filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
coord.request_stop()
coord.join(threads)i in range(1200):
# Retrieve a single instance:
example, label = sess.run([features, col5])
coord.request_stop()
coord.join(threads)
Normally you use a batch wise training anyways so you can load the data on the fly. 通常,您仍然使用批处理培训,以便您可以即时加载数据。 For example for images:
例如图像:
for bid in nrBatches:
batch_x, batch_y = load_data_from_hd(bid)
train_step.run(feed_dict={x: batch_x, y_: batch_y})
So you load every batch on the fly and only load the data which you need to load at any given moment. 因此,您可以即时加载每个批处理,只加载您在任何给定时刻需要加载的数据。 Naturally your training time will increase while using the harddisk instead of memory to load data.
当然,在使用硬盘而不是内存来加载数据时,您的训练时间会增加。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.