Downloading a large archive from AWS Glacier using Boto

Question

I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. The current method that I am using looks like this:

import os
import boto.glacier
import boto
import time

ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'

layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)

gv = layer2.get_vault(VAULT_NAME)

job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id

while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)

The problem is that the job ID expires after 24 hours, which is not enough time to retrieve the entire archive. I will need to break the download into at least 4 pieces. How can I do this and write the output to a single file?

Answer 1

It seems that you can simply specify the chunk_size parameter when calling job.download_to_file like so :

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)

However, if you can't download the all the chunks during the 24 hours I don't think you can choose to download only the one you missed using layer2.

First method

Using layer1 you can simply use the method get_job_output and specify the byte-range you want to download.

It would look like that :

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

With this script you should be able to rerun the script when it fails and continue to download your archive where you left it.

Second method

By digging in the boto code I found a "private" method in the Job class that you might also use : _download_byte_range . With this method you can still use layer2.

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

Answer 2

You have to add the region_name in your boto.connect_glacier function as the following :

    layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                                aws_secret_access_key = SECRET_ACCESS_KEY,
                                region_name = 'your region name')

Downloading a large archive from AWS Glacier using Boto

Question

2 answers

solution1
3 ACCPTED 2015-01-16 16:02:12

First method

Second method

solution2
0 2018-07-23 09:46:43

Downloading a large archive from AWS Glacier using Boto

Question

2 answers

solution1 3 ACCPTED 2015-01-16 16:02:12

First method

Second method

solution2 0 2018-07-23 09:46:43

solution1
3 ACCPTED 2015-01-16 16:02:12

solution2
0 2018-07-23 09:46:43