简体   繁体   中英

Use of spark to optimize S3 to S3 transfer

I am learning spark/scala and trying to experiment with the below scenario using scala language. Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.

Things done so far:
1) Use AWS S3 SDK and scala: - Create list of files from S3 source locations. - Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured). This works.

However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?

The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines? sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList) .flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work? I am not sure how to achieve this as of now.

Can you please help me understand if I am thinking in the right direction and also is this approach correct?

Thanks

I think AWS did not make it complicated though.

We had the same problem, we transferred around 2TB close to 10 mins.

If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.

https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

AWS CLI Command Example:

aws s3 sync s3://sourcebucket s3://destinationbucket

If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.

Hope it helps.

I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.

But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM