两个不同帐户/连接之间来自boto3或boto api的存储桶/密钥的并行副本

Question

I want to copy keys from buckets between 2 different accounts using boto3 api's. 我想使用boto3 api从2个不同帐户之间的存储桶中复制密钥。 In boto3, I executed the following code and the copy worked 在boto3中，我执行了以下代码，复制成功了

source =  boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))

Basically I am fetching data from GET and pasting that with PUT in another account. 基本上，我是从GET获取数据并将其与PUT粘贴到另一个帐户中。

On Similar lines in boto api, I have done the following 在boto api的类似代码行中，我完成了以下操作

source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)

destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())

The above code achieves the purpose of copying any type of data. 上面的代码达到了复制任何类型数据的目的。 But the speed is really very slow. 但是速度确实非常慢。 I get around 15-20 seconds to copy data for 1GB. 我大约需要15-20秒才能复制1GB的数据。 And I have to copy 100GB plus. 而且我必须复制100GB以上。 I tried python mutithreading wherein each thread does the copy operation. 我尝试了python mutithreading，其中每个线程都执行复制操作。 The performance was bad as it took 30 seconds to copy 1GB. 由于复制1GB需要30秒，因此性能很差。 I suspect GIL might be the issue here. 我怀疑GIL可能是这里的问题。 I did multiprocessing and I am getting the same result as of single process ie 15-20 seconds for 1GB file. 我进行了多处理，得到的结果与单处理相同，即1GB文件需要15-20秒。

I am using a very high end server with 48 cores and 128GB RAM. 我正在使用具有48个内核和128GB RAM的非常高端的服务器。 The network speed in my environment is 10GBPS. 我的环境中的网络速度为10GBPS。 Most of the search results tell about copying data between buckets in same account and not across accounts. 大多数搜索结果表明，是在同一帐户中的存储桶之间而不是跨帐户中的存储桶之间复制数据。 Can anyone please guide me here. 任何人都可以在这里指导我。 Is my approach wrong? 我的方法错了吗？ Does anyone have a better solution? 有谁有更好的解决方案？

Answer 1

Yes, it is wrong approach. 是的，这是错误的方法。

You shouldn't download the file. 您不应该下载文件。 You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. 您正在使用AWS基础设施，因此您应该利用有效的AWS后端调用来完成工作。 Your approach is wasting resources. 您的方法浪费资源。

boto3.client.copy will do the job better than this. boto3.client.copy会做得更好。

In addition, you didn't describe what you are trying to achieve(eg is this some sort of replication requirement? ). 另外，您没有描述您要实现的目标（例如，是否存在某种复制要求？）。

Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server. 因为在正确了解自己的需求后，您甚至可能不需要服务器来完成该工作：S3存储桶事件触发器，lambda等都可以在没有服务器的情况下执行复制工作。

To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account 要在两个不同的AWS账户之间复制文件，您可以签出此链接在AWS账户之间复制S3对象

Note : 注意：

S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. S3对每个人来说都是一个巨大的虚拟对象存储，这就是为什么存储桶名称必须唯一的原因。 This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , eg replication,copy, move file in the backend, without involving network traffics. 这也意味着，S3“控制器”可以完成许多类似于文件服务器的奇特工作，例如复制，复制，在后端移动文件，而不涉及网络流量。

As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server. 只要您为目标存储桶设置正确的IAM权限/策略，对象就可以在存储桶中移动而无需其他服务器。

This is almost similar to file server. 这几乎类似于文件服务器。 User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. 用户可以在没有“下载/上传”的情况下互相复制文件，而是只创建一个对所有人都具有写许可权的文件夹，而从另一位用户复制文件则是在文件服务器内完成的，具有最快的原始磁盘I / O性能。 You don't need powerful instance nor high performance network using backend S3 copy API. 使用后端S3复制API， 您不需要强大的实例，也不需要高性能的网络 。

Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics. 您的方法类似于使用同一文件服务器尝试从用户FTP下载文件，这会产生不必要的网络流量。

Answer 2

You should check out the TransferManager in boto3. 您应该在boto3中签出TransferManager 。 It will automatically handle the threading of multipart uploads in an efficient way. 它将以有效的方式自动处理分段上传的线程。 See the docs for more detail. 请参阅文档以获取更多详细信息。

Basically you must have to use the upload_file method and TransferManager will take care of the rest. 基本上，您必须使用upload_file方法，TransferManager将负责其余的工作。

import boto3

# Get the service client
s3 = boto3.client('s3')

# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

两个不同帐户/连接之间来自boto3或boto api的存储桶/密钥的并行副本

问题描述

2 个解决方案

解决方案1
2 2017-01-19 15:33:18

解决方案2
1 2017-01-19 14:53:25

两个不同帐户/连接之间来自boto3或boto api的存储桶/密钥的并行副本

问题描述

2 个解决方案

解决方案1 2 2017-01-19 15:33:18

解决方案2 1 2017-01-19 14:53:25

解决方案1
2 2017-01-19 15:33:18

解决方案2
1 2017-01-19 14:53:25