简体繁体 English

使用火花优化从S3到S3的传输

[英]Use of spark to optimize S3 to S3 transfer

原文 2018-04-15 15:51:45 7 2 scala/ amazon-web-services/ apache-spark/ amazon-s3/ sdk

I am learning spark/scala and trying to experiment with the below scenario using scala language. 我正在学习spark / scala，并尝试使用Scala语言尝试以下场景。 Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder. 场景：将多个文件从一个S3存储桶文件夹复制到另一个S3存储桶文件夹。

Things done so far: 到目前为止完成的工作：
1) Use AWS S3 SDK and scala: - Create list of files from S3 source locations. 1）使用AWS S3 SDK和scala：-从S3源位置创建文件列表。 - Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured). -遍历列表，传递步骤1中的源S3和目标S3位置，并使用S3 API copyObject将这些文件中的每个复制到目标位置（已配置）。 This works. 这可行。

However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files? 但是，我试图了解我是否在多个文件夹中包含大量文件，这是最有效的方法还是可以使用spark并行化此文件副本？

The approach that I am thinking is: 我在想的方法是：
1) Use S3 SDK to get the source paths similar to what's explained above 1）使用S3 SDK来获取类似于上面解释的源路径
2) Create an RDD for each of the files using sc.parallelize() - something on these lines? 2）使用sc.parallelize（）为每个文件创建一个RDD-在这些行上有东西吗？ sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList) .flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work? 3）我可以通过某种方式使用sc.wholeTextFiles使其工作吗？ I am not sure how to achieve this as of now. 我现在不确定如何实现这一目标。

Can you please help me understand if I am thinking in the right direction and also is this approach correct? 如果我朝着正确的方向思考，您能帮助我理解吗，这种方法是否正确？

Thanks 谢谢

2 个解决方案

I think AWS did not make it complicated though. 我认为AWS并没有使其变得复杂。

We had the same problem, we transferred around 2TB close to 10 mins. 我们遇到了同样的问题，我们在大约10分钟的时间内传输了大约2TB的数据。

If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself. 如果要从一个存储桶转移到另一个存储桶，最好使用内置功能在s3本身内进行传输。

https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

AWS CLI Command Example: AWS CLI命令示例：

aws s3 sync s3://sourcebucket s3://destinationbucket aws s3同步s3：// sourcebucket s3：// destinationbucket

If you want to do it programmatically you can use all SDK's to invoke the same type of command. 如果要以编程方式执行此操作，则可以使用所有SDK来调用相同类型的命令。 I would avoid reinventing the same wheel. 我会避免重新发明相同的轮子。

Hope it helps. 希望能帮助到你。

I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; 我有一段代码片段， cloudCp ，它使用spark进行高性能并行上传。 it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation. 要做一些复制操作，就像您将其放到该操作的AWS lib中一样。

But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. 但是：您可能不需要将工作推到许多机器上，因为每个PUT / x复制源调用可能很慢，但不占用任何带宽。 You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. 您可以启动具有许多线程和大型HTTP客户端池的进程，然后在该进程中全部运行它们。 Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. 拿列表，先按最大的几位排序，然后随机洗净其余部分以减少节流效果。 Print out counters to help profile... 打印计数器以帮助描述...