简体   繁体   English

从 Amazon S3 和 Boto3 下载并读取 HDF5 文件

[英]Download and Read HDF5 file from Amazon S3 and Boto3

I am quite new here.我在这里很新。 I will try to be clear.我会尽量说清楚。

I have created a hdf5 file with pytables and i have filled it with data.我用 pytables 创建了一个 hdf5 文件,并用数据填充了它。 Then, i have uploaded my file from the /tmp/ directory of my aws cluster to a S3 bucket using this code:然后,我使用以下代码将我的文件从我的 aws 集群的 /tmp/ 目录上传到 S3 存储桶:

  • s3_client.upload_file(local_file_key, aws_bucket_name, aws_file_key)

I have downloaded the same hdf5 file from S3 and store it again in the /tmp/ directory of my aws cluster using this code:我已经从 S3 下载了相同的 hdf5 文件,并使用以下代码将其再次存储在我的 aws 集群的 /tmp/ 目录中:

  • s3_client.download_file(aws_bucket_name, aws_file_key, another_local_file_key)

Until there, there no issue.直到那里,没有问题。 The problem appears when i want to read the uploaded file.当我想读取上传的文件时出现问题。

  • tables.open_file(another_local_file_key)

 File "H5F.c", line 604, in H5Fopen
        unable to open file
      File "H5Fint.c", line 1087, in H5F_open
        unable to read superblock
      File "H5Fsuper.c", line 277, in H5F_super_read
        file signature not found

    End of HDF5 error back trace

    Unable to open/create file '/tmp/from_aws_dataset.hdf5'

Then, i have made some verifications in the shell of my cluster.然后,我在集群的 shell 中进行了一些验证。

[user@cluster_ip_address tmp$] file my_dataset.hdf5

returns返回

 my_dataset.hdf5: Hierarchical Data Format (version 5) data

But [user@cluster_ip_address tmp$] file from_aws_dataset.hdf5 returns但是[user@cluster_ip_address tmp$] file from_aws_dataset.hdf5返回

 from_aws_dataset.hdf5: data

And in my python code,在我的python代码中,

tables.is_pytables_file('/tmp/from_aws_dataset.hdf5') returns None tables.is_pytables_file('/tmp/from_aws_dataset.hdf5')返回None

boto3 version: '1.4.7', python version: 2.7, tables version: '3.4.2', h5py version: '2.7.1'

Could someone help me, please?有人可以帮我吗?

My first guess would be that the file was transferred in text mode.我的第一个猜测是该文件是以文本模式传输的。 The HDF5 file signature was designed to detect that sort of munging. HDF5 文件签名旨在检测这种修改。

Have you tried using boto3's uploadfileobj() method instead of upload_file()?您是否尝试过使用 boto3 的 uploadfileobj() 方法而不是 upload_file()? It looks like the former is for binary files like HDF5.看起来前者适用于像 HDF5 这样的二进制文件。 It's unclear from the boto docs if the latter implies text. boto docs 不清楚后者是否暗示文本。

with open("myfile.h5", "rb") as f:
    s3.upload_fileobj(f, "bucket-name", "key-name")

It also looks like you can specify binary transfers explicitly using the put() method, like so:看起来您也可以使用 put() 方法显式指定二进制传输,如下所示:

s3.Object('mybucket', 'myfile.h5').put(Body=open('/tmp/myfile.h5', 'rb'))

The HDF5 file signature is documentedhere , if you are interested.如果您有兴趣,这里记录HDF5 文件签名。 Just scroll down a little to the first field of the superblock where is says 'Format Signature'.只需向下滚动到超级块的第一个字段,其中显示“格式签名”。

Old post but in the spirit of trying to close out questions...旧帖子,但本着试图结束问题的精神......

Can you try and manually download the file via the AWS S3 console, and read it directly in Python.您可以尝试通过 AWS S3 控制台手动下载文件,并直接在 Python 中读取它。 If that fails, then I would guess it that you are uploading the file incorrectly.如果失败,那么我猜你上传的文件不正确。 If it works, can you try to download the file using this command:如果可行,您能否尝试使用以下命令下载文件:

conn = boto.connect_s3('<<YOUR KEY ID>>','<<YOUR SECRET ACCESS KEY>>') #Make Connection
bucket = conn.get_bucket(THE_NAME_OF_YOUR_BUCKET) # Get bucket object
k = Key(bucket,FILE_AND_PATH) #Get Key object of file
k.get_contents_to_filename(LOCAL_PATH_TO_SAVE) #Saves the file to local. Should save and preserve everything

Have a look at this, it is quite useful: https://techietweak.wordpress.com/2016/05/16/file-handling-in-aws-s3-with-python-boto-library/看看这个,它非常有用: https : //techietweak.wordpress.com/2016/05/16/file-handling-in-aws-s3-with-python-boto-library/

For me this worked:对我来说这有效:

import boto3
s3 = boto3.resource('s3', region_name)
bucket = s3.Bucket(bucket_name)
with open(hdf5_file, 'rb') as f:
    bucket.Object(key).put(Body=f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM