简体   繁体   English

如何在不退出 memory 错误 (OOM) 的情况下将文件从一个存储桶复制到另一个存储桶

[英]How to copy files from one bucket to another without getting out of memory error (OOM)

I have successfully created code that compares the files in two buckets and only copies the files from the source bucket that aren't in the destination bucket.我已经成功创建了比较两个存储桶中的文件并仅从源存储桶复制不在目标存储桶中的文件的代码。 Works great on my sample folder with only 15 objects in it.在我的示例文件夹中效果很好,其中只有 15 个对象。 When I try to do it on my folders with my real data in it I am getting Out of Memory Errors, I get the "Killed" message.当我尝试在包含我的真实数据的文件夹上执行此操作时,我正在摆脱 Memory 错误,我收到“Killed”消息。 I thought that since it was doing this one at a time, essentially, that it shouldn't have this problem.我认为既然它一次只做一个,本质上,它不应该有这个问题。 lol, problem probably is the fact that I am using paginate, and saving that data, so lots of data.大声笑,问题可能是我正在使用分页并保存该数据,因此数据量很大。 I've re-written this so many times and this solution is so simple and works great on a small amount of data.我已经重写了很多次,这个解决方案非常简单,并且在少量数据上效果很好。 This is why I hate dumbing down my data, because reality is different.这就是为什么我讨厌简化我的数据,因为现实是不同的。 Do I have to go back to using list_objects_v2 ?我是否必须 go 重新使用list_objects_v2 How do I get through this?我该如何度过这个难关? Thanks!谢谢!

s3=boto3.resource('s3')
paginator = s3_client.get_paginator("list_objects_v2")
bucket1='bucket1_name'
bucket2='bucket2_name'
pages1=paginator.paginate(Bucket=bucket1,Prefix='test')
pages2=paginator.paginate(Bucket=bucket2,Prefix='test')
bucket1_list=[]
for page1 in pages1:
    for obj1 in page1['Contents']:
        obj1_key=obj1['Key']
        if not obj1_key.endswith("/"):
            bucket1_list.append(obj1_key)
bucket2_list=[]
for page2 in pages2:
    for obj2 in page2['Contents']:
        obj2_key=obj2['Key']
        if not obj2_key.endswith("/"):
            bucket2_list.append(obj2_key)
# Compares what keys from bucket1 aren't in bucket2
# and states what needs to be done to bucket2 to make them equal.
diff_bucket12=diff(bucket2_list,bucket1_list)

tocopy=[]
for i in bucket1_list:
    if i not in bucket2_list:
        tocopy.append(i)

# COPY
copy_to_bucket=s3.Bucket(bucket2)
copy_source=dict()
for i in tocopy:
    copy_source['Bucket']=bucket1
    copy_source['Key']=i
    response=copy_to_bucket.copy(copy_source,i)

You don't need to generate a complete list of all items to compare them.您不需要生成所有项目的完整列表来比较它们。

Since APIs that enumerate an AWS S3 bucket will always do so in lexicographical order order, you can enumerate both buckets, and compare each key in turn.由于枚举 AWS S3 存储桶的 API 将始终按字典顺序执行,因此您可以枚举两个存储桶,并依次比较每个键。 If they differ, that means the bucket with the key before the other is unique, so handle that case, and compare the next item with the current item in the other bucket.如果它们不同,则意味着在另一个之前具有键的存储桶是唯一的,因此请处理这种情况,并将下一个项目与另一个存储桶中的当前项目进行比较。

For instance, this little script will compare two buckets, and show the differences, including showing cases where the same key name appears in both buckets, but differs in sizes:例如,这个小脚本将比较两个存储桶,并显示差异,包括显示两个存储桶中出现相同键名但大小不同的情况:

#!/usr/bin/env ptyhon3

import boto3

def enumerate_bucket(s3, bucket):
    # Wrap a paginator so only one page is loaded at a time, and 
    # downstream users can call this to get all items in a bucket
    # without worrying about the pages
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=bucket):
        for cur in page.get("Contents", []):
            yield cur
    # Return one final empty object to signal the end to avoid
    # needing to look for the StopIteration exception
    yield None

def compare_buckets(s3, bucket_a, bucket_b):
    # Setup two enumerations, and grab the first object
    worker_a = enumerate_bucket(s3, bucket_a)
    worker_b = enumerate_bucket(s3, bucket_b)
    item_a = next(worker_a)
    item_b = next(worker_b)

    # Keep trying to compare items till we hit one end
    while item_a is not None and item_b is not None:
        if item_a['Key'] == item_b['Key']:
            # Same item, return an entry if they differ in size
            if item_a['Size'] != item_b['Size']:
                yield item_a['Key'], item_b['Key'], 'Changed'
            # Move to the next item for each bucket
            item_a = next(worker_a)
            item_b = next(worker_b)
        elif item_a['Key'] < item_b['Key']:
            # Bucket A has the item first in order, meaning
            # it's unique to this bucket
            yield item_a['Key'], None, 'Bucket A only'
            item_a = next(worker_a)
        else:
            # Bucket B has the item first in order, meaning
            # it's unique to this bucket
            yield None, item_b['Key'], 'Bucket B only'
            item_b = next(worker_b)

    # All done, return any remaining items in either bucket
    # that didn't finish already since they're unique.  Only
    # one of these two while loops will do anything, or 
    # neither if both buckets are at the end
    while item_a is not None:
        yield item_a['Key'], None, 'Bucket A only'
        item_a = next(worker_a)

    while item_b is not None:
        yield None, item_b['Key'], 'Bucket B only'
        item_b = next(worker_b)


# Now just loop through, show the different items, logic inside this
# for loop can trigger based off the values to handle unique items, or changed
# items
s3 = boto3.client('s3')
for key_a, key_b, status in compare_buckets(s3, 'example-bucket-1', 'example-bucket-2'):
    print(f"{key_a} -> {key_b} ({status})")

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Boto在Amazon S3上使用python脚本将文件从一个存储桶复制到另一个存储桶 - How to use python script to copy files from one bucket to another bucket at the Amazon S3 with boto 如何将 object 从一个桶复制到另一个桶 - How to copy an object from one bucket to another 使用 gsutil 将文件列表从一个 gcs 存储桶复制到另一个 - copy a list of files using gsutil from one gcs bucket to another 如何使用boto在亚马逊S3存储桶中有效地将所有文件从一个目录复制到另一个目录? - How to efficiently copy all files from one directory to another in an amazon S3 bucket with boto? 如何使用 python boto3 将文件和文件夹从一个 S3 存储桶复制到另一个 S3 - how to copy files and folders from one S3 bucket to another S3 using python boto3 如何将所有文件从一个文件夹复制到同一个 S3 存储桶中的另一个文件夹 - How to copy all files from one folder to another folder in the same S3 bucket 如何使用 boto3 仅使用 python 将更改的文件从一个 S3 存储桶复制到另一个存储桶? - How can I copy only changed files from one S3 Bucket to another one with python using boto3? AWS Lambda 尝试将文件从 S3 存储桶复制到另一个 S3 存储桶时出现无效存储桶名称错误 - Invalid bucket name error when AWS Lambda tries to copy files from an S3 bucket to another S3 bucket 如何在某些后缀的s3中从一个存储桶复制到另一个存储桶 - How to copy from one bucket to another bucket in s3 of certain suffix 在 GCS 中将文件从一个存储桶移动到另一个存储桶 - Moving files from one bucket to another in GCS
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM