简体   繁体   中英

Download and Read HDF5 file from Amazon S3 and Boto3

I am quite new here. I will try to be clear.

I have created a hdf5 file with pytables and i have filled it with data. Then, i have uploaded my file from the /tmp/ directory of my aws cluster to a S3 bucket using this code:

  • s3_client.upload_file(local_file_key, aws_bucket_name, aws_file_key)

I have downloaded the same hdf5 file from S3 and store it again in the /tmp/ directory of my aws cluster using this code:

  • s3_client.download_file(aws_bucket_name, aws_file_key, another_local_file_key)

Until there, there no issue. The problem appears when i want to read the uploaded file.

  • tables.open_file(another_local_file_key)

 File "H5F.c", line 604, in H5Fopen
        unable to open file
      File "H5Fint.c", line 1087, in H5F_open
        unable to read superblock
      File "H5Fsuper.c", line 277, in H5F_super_read
        file signature not found

    End of HDF5 error back trace

    Unable to open/create file '/tmp/from_aws_dataset.hdf5'

Then, i have made some verifications in the shell of my cluster.

[user@cluster_ip_address tmp$] file my_dataset.hdf5

returns

 my_dataset.hdf5: Hierarchical Data Format (version 5) data

But [user@cluster_ip_address tmp$] file from_aws_dataset.hdf5 returns

 from_aws_dataset.hdf5: data

And in my python code,

tables.is_pytables_file('/tmp/from_aws_dataset.hdf5') returns None

boto3 version: '1.4.7', python version: 2.7, tables version: '3.4.2', h5py version: '2.7.1'

Could someone help me, please?

My first guess would be that the file was transferred in text mode. The HDF5 file signature was designed to detect that sort of munging.

Have you tried using boto3's uploadfileobj() method instead of upload_file()? It looks like the former is for binary files like HDF5. It's unclear from the boto docs if the latter implies text.

with open("myfile.h5", "rb") as f:
    s3.upload_fileobj(f, "bucket-name", "key-name")

It also looks like you can specify binary transfers explicitly using the put() method, like so:

s3.Object('mybucket', 'myfile.h5').put(Body=open('/tmp/myfile.h5', 'rb'))

The HDF5 file signature is documentedhere , if you are interested. Just scroll down a little to the first field of the superblock where is says 'Format Signature'.

Old post but in the spirit of trying to close out questions...

Can you try and manually download the file via the AWS S3 console, and read it directly in Python. If that fails, then I would guess it that you are uploading the file incorrectly. If it works, can you try to download the file using this command:

conn = boto.connect_s3('<<YOUR KEY ID>>','<<YOUR SECRET ACCESS KEY>>') #Make Connection
bucket = conn.get_bucket(THE_NAME_OF_YOUR_BUCKET) # Get bucket object
k = Key(bucket,FILE_AND_PATH) #Get Key object of file
k.get_contents_to_filename(LOCAL_PATH_TO_SAVE) #Saves the file to local. Should save and preserve everything

Have a look at this, it is quite useful: https://techietweak.wordpress.com/2016/05/16/file-handling-in-aws-s3-with-python-boto-library/

For me this worked:

import boto3
s3 = boto3.resource('s3', region_name)
bucket = s3.Bucket(bucket_name)
with open(hdf5_file, 'rb') as f:
    bucket.Object(key).put(Body=f)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM