简体繁体中英

Use of spark to optimize S3 to S3 transfer

原文 2018-04-15 15:51:45 0 2 scala/ amazon-web-services/ apache-spark/ amazon-s3/ sdk

I am learning spark/scala and trying to experiment with the below scenario using scala language. Scenario: Copy multiple files from one S3 bucket folder to another S3 bucket folder.

Things done so far:
1) Use AWS S3 SDK and scala: - Create list of files from S3 source locations. - Iterate through the list, pass the source and target S3 locations from step 1 and use S3 API copyObject to copy each of these files to the target locations (configured). This works.

However, I am trying to understand if I have large number of files inside multiple folders, is this the most efficient way of doing or can I use spark to parallelize this copy of files?

The approach that I am thinking is:
1) Use S3 SDK to get the source paths similar to what's explained above
2) Create an RDD for each of the files using sc.parallelize() - something on these lines? sc.parallelize(objs.getObjectSummaries.map(_.getKey).toList) .flatMap { key => Source.fromInputStream(s3.getObject(bucket, key).getObjectContent: InputStream).getLines }
3) Can I use sc.wholeTextFiles in some way to make this work? I am not sure how to achieve this as of now.

Can you please help me understand if I am thinking in the right direction and also is this approach correct?

Thanks

2 answers

I think AWS did not make it complicated though.

We had the same problem, we transferred around 2TB close to 10 mins.

If you want to transfer from one bucket to another bucket, better to use the built-in functionality to transfer within s3 itself.

https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

AWS CLI Command Example:

aws s3 sync s3://sourcebucket s3://destinationbucket

If you want to do it programmatically you can use all SDK's to invoke the same type of command. I would avoid reinventing the same wheel.

Hope it helps.

I have a code snipped, cloudCp which uses spark for a high-performance parallelised upload; it'd be similar to do something for copy, where you'd drop to the AWS lib for that operation.

But: you may not need to push out work to many machines, as each of the PUT/x-copy-source calls may be slow, but it doesn't use any bandwidth. You could just start a process with many many threads & a large HTTP client pool and just run them all on in that process. Take the list, sort by largest few first and then shuffle the rest at random to reduce throttling effects. Print out counters to help profile...

How to optimize Spark for writing large amounts of data to S3

How to use s3 with Apache spark 2.2 in the Spark shell

Spark Streaming on a S3 Directory

Exporting Spark DataFrame to S3

Spark Streaming Data to S3

Using S3 (Frankfurt) with Spark

How to save and use Spark History Server logs in AWS S3

Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this?

Spark Scala S3 storage: permission denied

Unable to create partition on S3 using spark

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to optimize Spark for writing large amounts of data to S3 How to use s3 with Apache spark 2.2 in the Spark shell Spark Streaming on a S3 Directory Exporting Spark DataFrame to S3 Spark Streaming Data to S3 Using S3 (Frankfurt) with Spark How to save and use Spark History Server logs in AWS S3 Writing files to S3 using spark and scala is extremely slow. What is a better way to optimize this? Spark Scala S3 storage: permission denied Unable to create partition on S3 using spark

Related Tags

Use of spark to optimize S3 to S3 transfer

Question

2 answers

solution1
2 2018-04-15 19:12:44

solution2
0 2018-04-16 15:49:37

Use of spark to optimize S3 to S3 transfer

Question

2 answers

solution1 2 2018-04-15 19:12:44

solution2 0 2018-04-16 15:49:37

solution1
2 2018-04-15 19:12:44

solution2
0 2018-04-16 15:49:37