[英]How to copy files from one bucket to another without getting out of memory error (OOM)
I have successfully created code that compares the files in two buckets and only copies the files from the source bucket that aren't in the destination bucket.我已经成功创建了比较两个存储桶中的文件并仅从源存储桶复制不在目标存储桶中的文件的代码。 Works great on my sample folder with only 15 objects in it.
在我的示例文件夹中效果很好,其中只有 15 个对象。 When I try to do it on my folders with my real data in it I am getting Out of Memory Errors, I get the "Killed" message.
当我尝试在包含我的真实数据的文件夹上执行此操作时,我正在摆脱 Memory 错误,我收到“Killed”消息。 I thought that since it was doing this one at a time, essentially, that it shouldn't have this problem.
我认为既然它一次只做一个,本质上,它不应该有这个问题。 lol, problem probably is the fact that I am using paginate, and saving that data, so lots of data.
大声笑,问题可能是我正在使用分页并保存该数据,因此数据量很大。 I've re-written this so many times and this solution is so simple and works great on a small amount of data.
我已经重写了很多次,这个解决方案非常简单,并且在少量数据上效果很好。 This is why I hate dumbing down my data, because reality is different.
这就是为什么我讨厌简化我的数据,因为现实是不同的。 Do I have to go back to using
list_objects_v2
?我是否必须 go 重新使用
list_objects_v2
? How do I get through this?我该如何度过这个难关? Thanks!
谢谢!
s3=boto3.resource('s3')
paginator = s3_client.get_paginator("list_objects_v2")
bucket1='bucket1_name'
bucket2='bucket2_name'
pages1=paginator.paginate(Bucket=bucket1,Prefix='test')
pages2=paginator.paginate(Bucket=bucket2,Prefix='test')
bucket1_list=[]
for page1 in pages1:
for obj1 in page1['Contents']:
obj1_key=obj1['Key']
if not obj1_key.endswith("/"):
bucket1_list.append(obj1_key)
bucket2_list=[]
for page2 in pages2:
for obj2 in page2['Contents']:
obj2_key=obj2['Key']
if not obj2_key.endswith("/"):
bucket2_list.append(obj2_key)
# Compares what keys from bucket1 aren't in bucket2
# and states what needs to be done to bucket2 to make them equal.
diff_bucket12=diff(bucket2_list,bucket1_list)
tocopy=[]
for i in bucket1_list:
if i not in bucket2_list:
tocopy.append(i)
# COPY
copy_to_bucket=s3.Bucket(bucket2)
copy_source=dict()
for i in tocopy:
copy_source['Bucket']=bucket1
copy_source['Key']=i
response=copy_to_bucket.copy(copy_source,i)
You don't need to generate a complete list of all items to compare them.您不需要生成所有项目的完整列表来比较它们。
Since APIs that enumerate an AWS S3 bucket will always do so in lexicographical order order, you can enumerate both buckets, and compare each key in turn.由于枚举 AWS S3 存储桶的 API 将始终按字典顺序执行,因此您可以枚举两个存储桶,并依次比较每个键。 If they differ, that means the bucket with the key before the other is unique, so handle that case, and compare the next item with the current item in the other bucket.
如果它们不同,则意味着在另一个之前具有键的存储桶是唯一的,因此请处理这种情况,并将下一个项目与另一个存储桶中的当前项目进行比较。
For instance, this little script will compare two buckets, and show the differences, including showing cases where the same key name appears in both buckets, but differs in sizes:例如,这个小脚本将比较两个存储桶,并显示差异,包括显示两个存储桶中出现相同键名但大小不同的情况:
#!/usr/bin/env ptyhon3
import boto3
def enumerate_bucket(s3, bucket):
# Wrap a paginator so only one page is loaded at a time, and
# downstream users can call this to get all items in a bucket
# without worrying about the pages
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket):
for cur in page.get("Contents", []):
yield cur
# Return one final empty object to signal the end to avoid
# needing to look for the StopIteration exception
yield None
def compare_buckets(s3, bucket_a, bucket_b):
# Setup two enumerations, and grab the first object
worker_a = enumerate_bucket(s3, bucket_a)
worker_b = enumerate_bucket(s3, bucket_b)
item_a = next(worker_a)
item_b = next(worker_b)
# Keep trying to compare items till we hit one end
while item_a is not None and item_b is not None:
if item_a['Key'] == item_b['Key']:
# Same item, return an entry if they differ in size
if item_a['Size'] != item_b['Size']:
yield item_a['Key'], item_b['Key'], 'Changed'
# Move to the next item for each bucket
item_a = next(worker_a)
item_b = next(worker_b)
elif item_a['Key'] < item_b['Key']:
# Bucket A has the item first in order, meaning
# it's unique to this bucket
yield item_a['Key'], None, 'Bucket A only'
item_a = next(worker_a)
else:
# Bucket B has the item first in order, meaning
# it's unique to this bucket
yield None, item_b['Key'], 'Bucket B only'
item_b = next(worker_b)
# All done, return any remaining items in either bucket
# that didn't finish already since they're unique. Only
# one of these two while loops will do anything, or
# neither if both buckets are at the end
while item_a is not None:
yield item_a['Key'], None, 'Bucket A only'
item_a = next(worker_a)
while item_b is not None:
yield None, item_b['Key'], 'Bucket B only'
item_b = next(worker_b)
# Now just loop through, show the different items, logic inside this
# for loop can trigger based off the values to handle unique items, or changed
# items
s3 = boto3.client('s3')
for key_a, key_b, status in compare_buckets(s3, 'example-bucket-1', 'example-bucket-2'):
print(f"{key_a} -> {key_b} ({status})")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.