简体   繁体   中英

How can I backup or sync an Amazon S3 bucket?

I have critical data in an Amazon S3 bucket. I want to make a weekly backup of its other contents to another cloud service, or even inside S3. The best way would to sync my bucket to a new bucket inside a different region, in case of data loss.

How can I do that?

I prefer to backup locally using sync where only changes are updated. That is not the perfect backup solution but you can implement periodic updates later as you need:

s3cmd sync --delete-removed s3://your-bucket-name/ /path/to/myfolder/

If you never used s3cmd, install and configure it using:

pip install s3cmd
s3cmd --configure

Also there should be S3 backup services for $5/month but I would also check Amazon Glacier which lets you put nearly 40 GB single archive file if you use multi-part upload.

http://docs.aws.amazon.com/amazonglacier/latest/dev/uploading-archive-mpu.html#qfacts

Remember, if your S3 account is compromised, you have chance to lose all of your data as you would sync an empty folder or malformed files. So, you better write a script to archive your backup few times, for eg by detecting start of the week.

Update 01/17/2016:

Python based AWS CLI is very mature now.

Please use: https://github.com/aws/aws-cli
Example: aws s3 sync s3://mybucket.

This script backs up an S3 bucket:

#!/usr/bin/env python
from boto.s3.connection import S3Connection
import re
import datetime
import sys
import time

def main():
    s3_ID = sys.argv[1]
    s3_key = sys.argv[2]
    src_bucket_name = sys.argv[3]
    num_backup_buckets = sys.argv[4]
    connection = S3Connection(s3_ID, s3_key)
    delete_oldest_backup_buckets(connection, num_backup_buckets)
    backup(connection, src_bucket_name)

def delete_oldest_backup_buckets(connection, num_backup_buckets):
    """Deletes the oldest backup buckets such that only the newest NUM_BACKUP_BUCKETS - 1 buckets remain."""
    buckets = connection.get_all_buckets() # returns a list of bucket objects
    num_buckets = len(buckets)

    backup_bucket_names = []
    for bucket in buckets:
        if (re.search('backup-' + r'\d{4}-\d{2}-\d{2}' , bucket.name)):
            backup_bucket_names.append(bucket.name)

    backup_bucket_names.sort(key=lambda x: datetime.datetime.strptime(x[len('backup-'):17], '%Y-%m-%d').date())

    # The buckets are sorted latest to earliest, so we want to keep the last NUM_BACKUP_BUCKETS - 1
    delete = len(backup_bucket_names) - (int(num_backup_buckets) - 1)
    if delete <= 0:
        return

    for i in range(0, delete):
        print 'Deleting the backup bucket, ' + backup_bucket_names[i]
        connection.delete_bucket(backup_bucket_names[i])

def backup(connection, src_bucket_name):
    now = datetime.datetime.now()
    # the month and day must be zero-filled
    new_backup_bucket_name = 'backup-' + str('%02d' % now.year) + '-' + str('%02d' % now.month) + '-' + str(now.day);
    print "Creating new bucket " + new_backup_bucket_name
    new_backup_bucket = connection.create_bucket(new_backup_bucket_name)
    copy_bucket(src_bucket_name, new_backup_bucket_name, connection)


def copy_bucket(src_bucket_name, dst_bucket_name, connection, maximum_keys = 100):
    src_bucket = connection.get_bucket(src_bucket_name);
    dst_bucket = connection.get_bucket(dst_bucket_name);

    result_marker = ''
    while True:
        keys = src_bucket.get_all_keys(max_keys = maximum_keys, marker = result_marker)

        for k in keys:
            print 'Copying ' + k.key + ' from ' + src_bucket_name + ' to ' + dst_bucket_name

            t0 = time.clock()
            dst_bucket.copy_key(k.key, src_bucket_name, k.key)
            print time.clock() - t0, ' seconds'

        if len(keys) < maximum_keys:
            print 'Done backing up.'
            break

        result_marker = keys[maximum_keys - 1].key

if  __name__ =='__main__':main()

I use this in a rake task (for a Rails app):

desc "Back up a file onto S3"
task :backup do
     S3ID = "AKIAJM3FAKEFAKENRWVQ"
     S3KEY = "0A5kuzV+F1pbaMjZxHQAZfakedeJd0dfakeNpry"
     SRCBUCKET = "primary-mzgd"
     NUM_BACKUP_BUCKETS = 2

     Dir.chdir("#{Rails.root}/lib/tasks")
     system "./do_backup.py #{S3ID} #{S3KEY} #{SRCBUCKET} #{NUM_BACKUP_BUCKETS}"
end

The AWS CLI supports this now.

aws s3 cp s3://first-bucket-name s3://second-bucket-name --recursive

I've tried to do this in the past, and it's still annoyingly difficult, especially with large, multi-GB, many-millions-of-files buckets. The best solution I ever found was S3S3Mirror , which was made for exactly this purpose.

It's not as trivial as just flipping a switch, but it's still better than most other DIY solutions I've tried. It's multi-threaded and will copy the files much faster than similar single-threaded approaches.

One suggestion: Set it up on a separate EC2 instance, and once you run it, just shut that machine off but leave the AMI there. Then, when you need to re-run, fire the machine up again and you're all set. This is nowhere near as nice as a truly automated solution, but is manageable for monthly or weekly backups.

The best way would be to have the ability to sync my bucket with a new bucket in a different region in case of a data loss.

As of 24 Mar 2015 , this is possible using the Cross-Region Replication feature of S3.

One of the listed Use-case Scenarios is "compliance requirements", which seems to match your use-case of added protection of critical data against data loss:

Although, by default, Amazon S3 stores your data across multiple geographically distant Availability Zones, compliance requirements might dictate that you store data at even further distances. Cross-region replication allows you to replicate data between distant AWS regions to satisfy these compliance requirements.

See How to Set Up Cross-Region Replication for setup instructions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM