在Google ml-engine（tensorflow）中读取数据桶中的数据

Question

我在从Google托管的存储桶中读取数据时遇到问题。 我有一个包含我需要访问的约1000个文件的存储桶，保存在（例如）gs：// my-bucket / data

使用命令行或其他Google Python客户端中的gsutil，我可以访问存储桶中的数据，但默认情况下，google-cloud-ml-engine不支持导入这些API。

我需要一种方法来访问数据和文件的名称，使用默认的python库（即os）或使用tensorflow。 我知道tensorflow在某个地方内置了这个功能，我很难找到

理想情况下，我正在寻找一个命令的替换，如os.listdir（）和另一个命令为open（）

train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]

read_training_data使用张量流读取器对象

谢谢你的帮助！ （另外ps我的数据是二进制的）

Answer 1

如果您只想将数据读入内存，则此答案包含您需要的详细信息，即使用file_io模块。

也就是说，您可能需要考虑使用TensorFlow的内置读取机制，因为它们可以更高效。

有关阅读的信息可以在这里找到。 最新且最好的（但尚未成为官方“核心”TensorFlow的一部分）是Dataset API（此处有更多信息）。

要注意的一些事项：

你使用的格式TensorFlow可以读取吗？ 它可以转换为那种格式吗？
“喂养”的开销是否足以影响训练表现？
训练集太大而不适合记忆吗？

如果对一个或多个问题的答案是肯定的，尤其是后两个问题，请考虑使用读者。

Answer 2

物有所值。 我也在阅读文件时遇到了问题，特别是在datalab笔记本中的谷歌云存储中的二进制文件。 我设法做到的第一种方法是使用gs-utils将文件复制到我的本地文件系统，并使用tensorflow正常读取文件。 文件复制完成后，将在此处进行演示。

这是我的设置单元格

import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

这是一个用于在本地读取文件的单元格作为完整性检查。

# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3', 
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
    result = sess.run(waveform)
    print (result)

这是从gs：直接读取文件作为二进制文件。

# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
  result = sess.run(waveform)
  print (result)

在Google ml-engine（tensorflow）中读取数据桶中的数据

问题描述

2 个解决方案

解决方案1
3 2017-09-20 01:58:11

解决方案2
1 2018-07-09 16:08:38

在Google ml-engine（tensorflow）中读取数据桶中的数据

问题描述

2 个解决方案

解决方案1 3 2017-09-20 01:58:11

解决方案2 1 2018-07-09 16:08:38

解决方案1
3 2017-09-20 01:58:11

解决方案2
1 2018-07-09 16:08:38