Fastest way to copy S3 files without exact sync

Question

I have a S3 bucket with many objects, wanting to copy them to a different S3 bucket. It's not a direct sync because there's a couple of requirements:

I want to simplify the object key, so that /images/all/abcdef.png is copied to /images/abcdef.png (stripping out the /all )
Not all files will be copied across. The object keys are all listed in a file (one key per line), since many old ones should not be copied.

Running this with s3 command line tool is extremely slow. I used the following script:

#!/bin/bash
while read key; do
  newkey=$(echo $key | sed 's/all\///g')
  aws s3 cp s3://oldbucket/images/$key s3://newbucket/images/$newkey
done < $keys

It takes a second or two per file, so would take many days to copy all (over 1 million objects). Note I'm running this from an external server, not an AWS machine, albeit physically close (Linode New Jersey to AWS US East 1). The objects are images from around 30KB up to 3MB.

I've tried splitting the keys file and running in parallel, but doesn't seem to change the speed, not sure why. I'm also unable to add the S3 fast transfer option as the original bucket has a "." in it (S3 restriction). I'd like to know if there's a faster way to do this.

Answer 1

S3P is perhaps the fastest way to copy S3 files right now (2020). I've sustained speeds as high as 8gigabytes/second .

(disclaimer: I wrote it.)

NPM: npmjs.com/package/s3p
Source: github.com/generalui/s3p

Arbitrary Key Rewriting

In addition to being fast, S3P is particularly good for your task. S3P allows you to provide arbitrary key-rewriting rules written in JavaScript. For example, to remove the "/all/" from your keys, you could do the following:

npx s3p cp \
  --bucket my-bucket\
  --to-bucket my-to-bucket\
  --to-key "js:(key) => key.replace('/all/', '/')"

Why is S3P so fast?

Every tool I found is hindered by the fact that they list S3 buckets serially - request 1000 items, wait, request the next 1000 items. I figured out a way to use the S3 API to parallelize listing and significantly accelerate any S3 operation that involves listing a large number of files.

Easy to Try

You can try out s3p easily if you have Node.js installed, just open a terminal and run the following to get a list of commands:

npx s3p

Note: Though you can run this from your local machine, and it's still very fast, you'll get maximum performance with an decent sized EC2 instance in the same region as your S3 buckets (eg m5.xlarge).

Answer 2

The aws s3 cp command uses some special code within the AWS CLI to figure out where objects are being copied. It then issues normal Amazon S3 API calls to copy the actual data:

If the source and destination are both S3 buckets, it uses CopyObject() to tell S3 to directly copy the object between buckets (without downloading/uploading)
If the source is the local computer and the destination is an S3 bucket, it uses PutObject()
If the source is an S3 bucket and the destination is the local computer, it uses GetObject()

The aws s3 sync command does similar (but first compares source/destination files).

A closer proximity to the Amazon S3 endpoints (eg running the commands from an Amazon EC2 instance in the same region) would minimise network overhead, possibly making the object copies more efficient.

Running commands in parallel definitely would make things go faster, since S3 can copy files in parallel. I often open two terminal windows to an EC2 instance and issue commands in each window. They run independently of each other, so that should greatly speed things up. (That's not necessarily the case if objects are being uploaded or downloaded, since there are network throughput limits. But, since your script is simply issuing Copy commands, that won't matter.)

Alternative: Use aws s3 mv

If you are wanting to move objects (rather than just copy them), you could use aws s3 mv . It actually performs a CopyObject() and then a DeleteObject() on the original file.

Answer 3

John's answer is very complete. I just add an example of code for your task to run faster in parallel with several workers (using GNU parallel).

#!/bin/bash
while read key; do
  newkey=$(echo $key | sed 's/all\///g')
  echo aws s3 cp "s3://oldbucket/images/$key" "s3://newbucket/images/$newkey"
done < $keys > jobs.txt

workers=30
parallel -j $workers < jobs.txt

Fastest way to copy S3 files without exact sync

Question

3 answers

solution1
3 2020-08-24 19:10:30

solution2
2 ACCPTED 2020-01-24 04:22:32

solution3
0 2020-12-23 10:54:17

Fastest way to copy S3 files without exact sync

Question

3 answers

solution1 3 2020-08-24 19:10:30

solution2 2 ACCPTED 2020-01-24 04:22:32

solution3 0 2020-12-23 10:54:17

solution1
3 2020-08-24 19:10:30

solution2
2 ACCPTED 2020-01-24 04:22:32

solution3
0 2020-12-23 10:54:17