简体   繁体   English

在Google ml-engine(tensorflow)中读取数据桶中的数据

[英]Reading data from bucket in Google ml-engine (tensorflow)

I am having issues reading data from a bucket hosted by Google. 我在从Google托管的存储桶中读取数据时遇到问题。 I have a bucket containing ~1000 files I need to access, held at (for example) gs://my-bucket/data 我有一个包含我需要访问的约1000个文件的存储桶,保存在(例如)gs:// my-bucket / data

Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine. 使用命令行或其他Google Python客户端中的gsutil,我可以访问存储桶中的数据,但默认情况下,google-cloud-ml-engine不支持导入这些API。

I need a way to access both the data and the names of the files, either with a default python library (ie os) or using tensorflow. 我需要一种方法来访问数据和文件的名称,使用默认的python库(即os)或使用tensorflow。 I know tensorflow has this functionality built in somewhere, it has been hard for me to find 我知道tensorflow在某个地方内置了这个功能,我很难找到

Ideally I am looking for replacements for one command such as os.listdir() and another for open() 理想情况下,我正在寻找一个命令的替换,如os.listdir()和另一个命令为open()

train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]

Where read_training_data uses a tensorflow reader object read_training_data使用张量流读取器对象

Thanks for any help! 谢谢你的帮助! ( Also ps my data is binary ) (另外ps我的数据是二进制的)

If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module. 如果您只想将数据读入内存,则此答案包含您需要的详细信息,即使用file_io模块。

That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant. 也就是说,您可能需要考虑使用TensorFlow的内置读取机制,因为它们可以更高效。

Information on reading can be found here . 有关阅读的信息可以在这里找到。 The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here ). 最新且最好的(但尚未成为官方“核心”TensorFlow的一部分)是Dataset API( 此处有更多信息)。

Some things to keep in mind: 要注意的一些事项:

  • Are you using a format TensorFlow can read? 你使用的格式TensorFlow可以读取吗? Can it be converted to that format? 它可以转换为那种格式吗?
  • Is the overhead of "feeding" high enough to affect training performance? “喂养”的开销是否足以影响训练表现?
  • Is the training set too big to fit in memory? 训练集太大而不适合记忆吗?

If the answer is yes to one or more of the questions, especially the latter two, consider using readers. 如果对一个或多个问题的答案是肯定的,尤其是后两个问题,请考虑使用读者。

For what its worth. 物有所值。 I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. 我也在阅读文件时遇到了问题,特别是在datalab笔记本中的谷歌云存储中的二进制文件。 The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. 我设法做到的第一种方法是使用gs-utils将文件复制到我的本地文件系统,并使用tensorflow正常读取文件。 This is demonstrated here after the file copy was done. 文件复制完成后,将在此处进行演示。

Here is my setup cell 这是我的设置单元格

import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

Here is a cell for reading the file locally as a sanity check. 这是一个用于在本地读取文件的单元格作为完整性检查。

# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3', 
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
    result = sess.run(waveform)
    print (result)

Here is reading the file from gs: directly as a binary file. 这是从gs:直接读取文件作为二进制文件。

# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
  result = sess.run(waveform)
  print (result)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM