简体   繁体   English

使用Boto从AWS Glacier下载大型存档

[英]Downloading a large archive from AWS Glacier using Boto

I am trying to download a large archive (~ 1 TB) from Glacier using the Python package, Boto. 我正在尝试使用Python包Boto从Glacier下载大型存档(~1 TB)。 The current method that I am using looks like this: 我使用的当前方法如下所示:

import os
import boto.glacier
import boto
import time

ACCESS_KEY_ID = 'XXXXX'
SECRET_ACCESS_KEY = 'XXXXX'
VAULT_NAME = 'XXXXX'
ARCHIVE_ID = 'XXXXX'
OUTPUT = 'XXXXX'

layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                              aws_secret_access_key = SECRET_ACCESS_KEY)

gv = layer2.get_vault(VAULT_NAME)

job = gv.retrieve_archive(ARCHIVE_ID)
job_id = job.id

while not job.completed:
    time.sleep(10)
    job = gv.get_job(job_id)

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT)

The problem is that the job ID expires after 24 hours, which is not enough time to retrieve the entire archive. 问题是作业ID在24小时后到期,这还不足以检索整个存档。 I will need to break the download into at least 4 pieces. 我需要将下载分解为至少4个。 How can I do this and write the output to a single file? 我该怎么做并将输出写入单个文件?

It seems that you can simply specify the chunk_size parameter when calling job.download_to_file like so : 看来你可以在调用job.download_to_file时简单地指定chunk_size参数, job.download_to_file所示:

if job.completed:
    print "Downloading archive"
    job.download_to_file(OUTPUT, chunk_size=1024*1024)

However, if you can't download the all the chunks during the 24 hours I don't think you can choose to download only the one you missed using layer2. 但是,如果您无法在24小时内下载所有块,我认为您不能选择仅使用layer2下载您错过的那个块。

First method 第一种方法

Using layer1 you can simply use the method get_job_output and specify the byte-range you want to download. 使用layer1,您只需使用方法get_job_output并指定要下载的字节范围。

It would look like that : 看起来像这样:

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = gv.get_job_output(VAULT_NAME, job_id, (file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

With this script you should be able to rerun the script when it fails and continue to download your archive where you left it. 使用此脚本,您应该能够在脚本失败时重新运行该脚本,并继续将您的存档下载到您离开的位置。

Second method 第二种方法

By digging in the boto code I found a "private" method in the Job class that you might also use : _download_byte_range . 通过挖掘boto代码,我在Job类中找到了一个你也可以使用的“私有”方法: _download_byte_range With this method you can still use layer2. 使用此方法,您仍然可以使用layer2。

file_size = check_file_size(OUTPUT)

if job.completed:
    print "Downloading archive"
    with open(OUTPUT, 'wb') as output_file:
        i = 0
        while True:
            response = job._download_byte_range(file_size + 1024 * 1024 * i, file_size + 1024 * 1024 * (i + 1)))
            output_file.write(response)
            if len(response) < 1024 * 1024:
                break
            i += 1

You have to add the region_name in your boto.connect_glacier function as the following : 您必须在boto.connect_glacier函数中添加region_name,如下所示:

    layer2 = boto.connect_glacier(aws_access_key_id = ACCESS_KEY_ID,
                                aws_secret_access_key = SECRET_ACCESS_KEY,
                                region_name = 'your region name')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM