在CloudML中的Tensorflow中读取熊猫泡菜文件

Question

I'm getting an error trying to read a pandas pickle eg df.to_pickle() method, which is stored in Google Cloud storage. 我在尝试读取存储在Google Cloud存储中的熊猫泡菜时遇到错误，例如df.to_pickle()方法。 I'm trying to do the following: 我正在尝试执行以下操作：

path_to_gcs_file = 'gs://xxxxx'
f = file_io.FileIO(path_to_gcs_file, mode='r').read()
train_df = pd.read_pickle(f)
f.close()

I get the following error: 我收到以下错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Alternatively I tried: 或者，我尝试了：

f = BytesIO(file_io.read_file_to_string(path_to_gcs_file, binary_mode=True))
train_df = pd.read_pickle(f)

Which works locally but not on CloudML! 在本地工作，但不能在CloudML上工作！

f = file_io.read_file_to_string(path_to_gcs_file, binary_mode=True)
train_df = pd.read_pickle(f)

Gives me an error: AttributeError: 'bytes' object has no attribute 'seek' 给我一个错误：AttributeError：'bytes'对象没有属性'seek'

Answer 1

pandas.read_pickle accepts a path as the first argument; pandas.read_pickle接受路径作为第一个参数； you are passing a File object ( file.FileIO ) and a bytes object ( read_to_string ). 您正在传递一个File对象（ file.FileIO ）和一个bytes对象（ read_to_string ）。

So far I have not found a way to read a pickle object directly from GCS using pandas, so you will have to copy it to the machine. 到目前为止，我还没有找到使用熊猫直接从GCS中读取泡菜对象的方法，因此您必须将其复制到机器上。 You could use file_io.copy for that: 您可以file_io.copy使用file_io.copy ：

file_io.copy('gs://xxxx', '/tmp/x.pkl')
train_df = pd.read_pickle('/tmp/x.pkl')

Answer 2

You should be able to get away with using a context manager, but I think you're pulling the end of the certificate using this way, so you should instead download the file through the api 您应该能够使用上下文管理器，但是我认为您正在使用这种方式来拉证书的末尾，因此您应该改为通过api下载文件

pip install --upgrade google-cloud-storage

Then 然后

# Initialise a client
storage_client = storage.Client("[Your project name here]")
# Create a bucket object for our bucket
bucket = storage_client.get_bucket(bucket_name)
# Create a blob object from the filepath
blob = bucket.blob("folder_one/foldertwo/filename.extension")
# Download the file to a destination
blob.download_to_filename(path_to_gcs_file)
with open(path_to_gcs_file, "rb" as f:
    train_df = = pickle.load(f)

Much was taken from this answer: Downloading a file from google cloud storage inside a folder 从这个答案中获得了很多好处：从文件夹中的Google云存储下载文件

在CloudML中的Tensorflow中读取熊猫泡菜文件

问题描述

2 个解决方案

解决方案1
1 2018-10-05 18:35:31

解决方案2
0 2018-10-05 17:16:37

在CloudML中的Tensorflow中读取熊猫泡菜文件

问题描述

2 个解决方案

解决方案1 1 2018-10-05 18:35:31

解决方案2 0 2018-10-05 17:16:37

解决方案1
1 2018-10-05 18:35:31

解决方案2
0 2018-10-05 17:16:37