简体   繁体   English

Boto3 分段上传和 md5 检查

[英]Boto3 multipart upload and md5 checking

Is there a boto3 function to upload a file to S3 that verifies the MD5 checksum after upload and takes care of multipart uploads and other concurrency issues?是否有 boto3 function 将文件上传到 S3 以在上传后验证 MD5 校验和并处理分段上传和其他并发问题?

According to the documentation, upload_file takes care of multipart uploads and put_object can check the MD5 sum.根据文档,upload_file 负责分段上传,而 put_object 可以检查 MD5 总和。 Is there a way for me to do both without writing a long function of my own?我有没有办法在不写我自己的长 function 的情况下做到这两点? Awscli is based on boto3 and it does that ( https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html ) but I'm not sure about boto3 itself. Awscli 基于 boto3 并且它做到了( https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html )但我不确定 boto3 本身。

As far as I know, there is no native way in boto3 to do a multi-part upload and then easily compare md5 hashes.据我所知,boto3 中没有本地方法可以进行多部分上传,然后轻松比较 md5 哈希值。 The answer here is to either use aws-cli or something like the code below if you want to stick with boto3 and multi-part upload (please note, this is a rough example, not production code):如果您想坚持使用 boto3 和多部分上传,这里的答案是使用 aws-cli 或类似下面的代码(请注意,这是一个粗略的示例,而不是生产代码):

import boto3
import hashlib

from botocore.exceptions import ClientError
from botocore.client import Config
from boto3.s3.transfer import TransferConfig


chunk_size=8 * 1024 * 1024

# This function is a re-worked function taken from here: https://stackoverflow.com/questions/43794838/multipart-upload-to-s3-with-hash-verification 
# Credits to user: https://stackoverflow.com/users/518169/hyperknot
def calculate_s3_etag(file_path, chunk_size=chunk_size):
    chunk_md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)

            if not data:
                break
            
            chunk_md5s.append(hashlib.md5(data))
    
    num_hashes = len(chunk_md5s)

    if not num_hashes:
        # do whatever you want to do here
        raise ValueError

    if num_hashes == 1:
        return f"{chunk_md5s[0].hexdigest()}"

    digest_byte_string = b''.join(m.digest() for m in chunk_md5s)
    digests_md5 = hashlib.md5(digest_byte_string)

    return f"{digests_md5.hexdigest()}-{num_hashes}"


def s3_md5sum(bucket_name, resource_name, client):
    try:
        return client.head_object(
            Bucket=bucket_name,
            Key=resource_name
        )['ETag'][1:-1]
    except ClientError:
        # do whatever you want to do here
        raise ClientError


bucket = "<INSERT_BUCKET_NAME>"
file = "<INSERT_FILE_NAME>"

aws_region = "<INSERT_REGION>"
aws_credentials = {
    "aws_access_key_id": "<INSERT_ACCESS_KEY>",
    "aws_secret_access_key": "<INSERT_SECRET_KEY>",
}

client = boto3.client(
    "s3", config=Config(region_name=aws_region), **aws_credentials
)
transfer_config = TransferConfig(multipart_chunksize=chunk_size)

client.upload_file(file, bucket, file, Config=transfer_config)

tag = calculate_s3_etag(file)
result = s3_md5sum(bucket, file, client)

assert tag == result

Explanation:解释:

  • During multi-part upload, the file will be split into a certain number of chunks, a hash will be calculated for each of them, combined into a byte string, and a hash of this byte string will be listed in the S3 object E-Tag as smth looking like "<hash_string>-<num_chunks>".分段上传时,文件会被分割成一定数量的chunk,每一个都会计算一个hash,组合成一个字节串,这个字节串的一个hash会列在S3 ZA8CFDE6331AC4BEB26标记为看起来像“<hash_string>-<num_chunks>”的smth。
  • What you would like to do is to essentially recreate the E-Tag locally and - after the upload - compare it with what we have in S3.您想要做的基本上是在本地重新创建电子标签,并在上传后将其与我们在 S3 中的内容进行比较。
  • To recreate it locally, we need to split the file in the same number of chunks (as used during the upload), calculate their hashsums, add them to a byte string, get a hash on that and then produce a string in the format of "<hash_string>-<num_chunks>".要在本地重新创建它,我们需要将文件拆分为相同数量的块(与上传期间使用的一样),计算它们的哈希和,将它们添加到字节字符串中,获取 hash ,然后生成格式为“<hash_string>-<num_chunks>”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM