如果我想使用無法通過TensorFlow加載到內存中的大型數據集，我該怎么辦？

Question

我想使用一個無法加載到內存中的大型數據集來訓練TensorFlow模型。 但我不知道應該做些什么。

我已經閱讀了一些關於TFRecords文件格式和官方文檔的精彩帖子。 巴士我還是想不出來。

TensorFlow是否有完整的解決方案計划？

Answer 1

考慮使用tf.TextLineReader ，它與tf.train.string_input_producer一起允許您從磁盤上的多個文件加載數據（如果您的數據集足夠大，需要將其分散到多個文件中）。

請參閱https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files

上面鏈接中的代碼段：

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for     filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)

Answer 2

通常，您仍然使用批處理培訓，以便您可以即時加載數據。 例如圖像：

for bid in nrBatches:
     batch_x, batch_y = load_data_from_hd(bid)
     train_step.run(feed_dict={x: batch_x, y_: batch_y})

因此，您可以即時加載每個批處理，只加載您在任何給定時刻需要加載的數據。 當然，在使用硬盤而不是內存來加載數據時，您的訓練時間會增加。

如果我想使用無法通過TensorFlow加載到內存中的大型數據集，我該怎么辦？

問題描述

2 個解決方案

解決方案1
2 已采納 2017-02-22 11:55:13

解決方案2
1 2017-02-22 11:37:22

如果我想使用無法通過TensorFlow加載到內存中的大型數據集，我該怎么辦？

問題描述

2 個解決方案

解決方案1 2 已采納 2017-02-22 11:55:13

解決方案2 1 2017-02-22 11:37:22

解決方案1
2 已采納 2017-02-22 11:55:13

解決方案2
1 2017-02-22 11:37:22