简体   繁体   English

python boto3 for AWS - S3 Bucket Sync 优化

[英]python boto3 for AWS - S3 Bucket Sync optimization

currently I am trying to compare two S3 buckets with the target to delete files.目前我正在尝试将两个 S3 存储桶与删除文件的目标进行比较。

problem defintion: -BucketA -BucketB问题定义: -BucketA -BucketB

The script is looking for files (same key name) in BucketB which are not available in BucketA.该脚本正在 BucketB 中查找 BucketA 中不可用的文件(相同的键名)。 That files which are only available in BucketB have to be deleted.必须删除仅在 BucketB 中可用的文件。 The buckets contain about 3-4 Million files each.每个存储桶包含大约 3-4 百万个文件。

Many thanks.非常感谢。

Kind regards, Alexander亲切的问候,亚历山大

My solution idea:我的解决思路:

The List filling is quite slow.列表填充很慢。 Is there any possibility to accelerate it?有没有可能加速它?

#Filling the lists
#e.g. BucketA (BucketB same procedure)
s3_client = boto3.client(3")
bucket_name = "BucketA"
paginator = s3_client.get_paginator("list_objects_v2")
response = paginator.paginate(Bucket="BucketA", PaginationConfig={"PageSize": 2})
for page in response:
    files = page.get("Contents")
    for file in files:
        ListA.append(file['Key'])
#finding the deletion values
diff = list(set(BucketB) - set(BucketA))
 
#delete files from BucketB (building junks, since with delete_objects_from_bucket max. 1000 objects at once)
for delete_list in group_elements(diff , 1000):
    delete_objects_from_bucket(delete_list)

The ListBucket() API call only returns 1000 objects at a time, so listing buckets with 100,000+ objects is very slow and best avoided. ListBucket() API 调用一次仅返回 1000 个对象,因此列出包含 100,000 多个对象的存储桶非常慢,最好避免使用。 You have 3-4 million objects, so definitely avoid listing them!你有 3-4 百万个对象,所以一定要避免列出它们!

Instead, use Amazon S3 Inventory , which can provide a daily or weekly CSV file listing all objects in a bucket.相反,使用Amazon S3 Inventory ,它可以提供每日或每周 CSV 文件,列出存储桶中的所有对象。 Activate it for both buckets, and then use the provided CSV files as input to make the comparison.为两个存储桶激活它,然后使用提供的 CSV 文件作为输入进行比较。

You can also use the CSV file generated by Amazon S3 Inventory as a manifest file for Amazon S3 Batch Operations.您还可以使用 Amazon S3 Inventory 生成的 CSV 文件作为 Amazon S3 批量操作的清单文件 So, your code could generate a file that lists only the objects that you would like deleted, then get S3 Batch Operations to process those deletions.因此,您的代码可以生成一个仅列出您想要删除的对象的文件,然后获取 S3 批量操作来处理这些删除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM