如果我想使用无法通过TensorFlow加载到内存中的大型数据集，我该怎么办？

Question

I want to use a large dataset that cannot load into the memory once to train a model with TensorFlow. 我想使用一个无法加载到内存中的大型数据集来训练TensorFlow模型。 But I don't know what exacty I should do. 但我不知道应该做些什么。

I have read some great posts about TFRecords file format and the official document. 我已经阅读了一些关于TFRecords文件格式和官方文档的精彩帖子。 Bus I still can't figure it out. 巴士我还是想不出来。

Is there a complete solution plan with TensorFlow? TensorFlow是否有完整的解决方案计划？

Answer 1

Consider using tf.TextLineReader which in conjunction with tf.train.string_input_producer allows you to load data from multiple files on disk (if your dataset is large enough that it needs to be spread out into multiple files). 考虑使用tf.TextLineReader ，它与tf.train.string_input_producer一起允许您从磁盘上的多个文件加载数据（如果您的数据集足够大，需要将其分散到多个文件中）。

See https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files 请参阅https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files

Code snippet from the link above: 上面链接中的代码段：

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for     filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)

Answer 2

Normally you use a batch wise training anyways so you can load the data on the fly. 通常，您仍然使用批处理培训，以便您可以即时加载数据。 For example for images: 例如图像：

for bid in nrBatches:
     batch_x, batch_y = load_data_from_hd(bid)
     train_step.run(feed_dict={x: batch_x, y_: batch_y})

So you load every batch on the fly and only load the data which you need to load at any given moment. 因此，您可以即时加载每个批处理，只加载您在任何给定时刻需要加载的数据。 Naturally your training time will increase while using the harddisk instead of memory to load data. 当然，在使用硬盘而不是内存来加载数据时，您的训练时间会增加。

如果我想使用无法通过TensorFlow加载到内存中的大型数据集，我该怎么办？

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-02-22 11:55:13

解决方案2
1 2017-02-22 11:37:22

如果我想使用无法通过TensorFlow加载到内存中的大型数据集，我该怎么办？

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-02-22 11:55:13

解决方案2 1 2017-02-22 11:37:22

解决方案1
2 已采纳 2017-02-22 11:55:13

解决方案2
1 2017-02-22 11:37:22