简体   繁体   中英

Reading data from bucket in Google ml-engine (tensorflow)

I am having issues reading data from a bucket hosted by Google. I have a bucket containing ~1000 files I need to access, held at (for example) gs://my-bucket/data

Using gsutil from the command line or other of Google's Python API clients I can access the data in the bucket, however importing these APIs is not supported by default on google-cloud-ml-engine.

I need a way to access both the data and the names of the files, either with a default python library (ie os) or using tensorflow. I know tensorflow has this functionality built in somewhere, it has been hard for me to find

Ideally I am looking for replacements for one command such as os.listdir() and another for open()

train_data = [read_training_data(filename) for filename in os.listdir('gs://my-bucket/data/')]

Where read_training_data uses a tensorflow reader object

Thanks for any help! ( Also ps my data is binary )

If you just want to read data into memory, then this answer has the details you need, namely, to use the file_io module.

That said, you might want to consider using built-in reading mechanisms for TensorFlow as they can be more performant.

Information on reading can be found here . The latest and greatest (but not yet part of official "core" TensorFlow) is the Dataset API (more info here ).

Some things to keep in mind:

  • Are you using a format TensorFlow can read? Can it be converted to that format?
  • Is the overhead of "feeding" high enough to affect training performance?
  • Is the training set too big to fit in memory?

If the answer is yes to one or more of the questions, especially the latter two, consider using readers.

For what its worth. I also had problems reading files, in particular binary files from google cloud storage inside a datalab notebook. The first way I managed to do it was by copying files using gs-utils to my local filesystem and using tensorflow to read the files normally. This is demonstrated here after the file copy was done.

Here is my setup cell

import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

Here is a cell for reading the file locally as a sanity check.

# this works for reading local file
audio_binary_local = tf.read_file("100852.mp3")
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_local, file_format='mp3', 
samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
    result = sess.run(waveform)
    print (result)

Here is reading the file from gs: directly as a binary file.

# this works for remote files in gs:
gsfilename = 'gs://proj-getting-started/UrbanSound/data/air_conditioner/100852.mp3'
# python 2
#audio_binary_remote = tf.gfile.Open(gsfilename).read()
# python 3
audio_binary_remote = tf.gfile.Open(gsfilename, 'rb').read()
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary_remote, file_format='mp3', samples_per_second=44100, channel_count=2)
# this will show that it has two channels of data
with tf.Session() as sess:
  result = sess.run(waveform)
  print (result)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM