简体   繁体   中英

Best way to move contents from one s3 object/folder to another within the same bucket?

I have a job that needs to transfer ~150GB from one folder into another. This runs once a day.

def copy_new_data_to_official_location(bucket_name):
    s3 = retrieve_aws_connection('s3')
    objects_to_move = s3.list_objects(
        Bucket=bucket_name, Prefix='my/prefix/here')

    for item in objects_to_move['Contents']:
        print(item['Key'])
        copy_source = {
            'Bucket': bucket_name,
            'Key': item['Key']
        }

        original_key_name = item['Key'].split('/')[2]
        s3.copy(copy_source, bucket_name, original_key_name)

I have the following. This process takes a bit of time and also, if I'm reading correctly, I'm paying transfer fees moving between objects.

Is there a better way?

Flow:

  1. Run large scale job on Spark to feed data in from folder_1 and external source
  2. Copy output to folder_2
  3. Delete all contents from folder_1
  4. Copy contents of folder_2 to folder_1

Repeat above flow on daily cadence.

Spark is a bit strange, so need to copy output to folder_2, otherwise redirecting to folder_1 causes a data wipe before the job even kicks off.

There are no Data Transfer fees if the source and destination buckets are in the same Region. Since you are simply copying within the same bucket, there would be no Data Transfer fee.

150 GB is not very much data, but it can take some time to copy if there are many objects. The overhead of initiating the copy can sometimes take more time than actually copying the data. When using the copy() command, all data is transferred within Amazon S3 -- nothing is copied down to the computer where the command is issued.

There are several ways you could make the process faster:

  • You could issue the copy() commands in parallel. In fact, this is how the AWS Command-Line Interface (CLI) works when using aws s3 cp --recursive and aws s3 sync .

  • You could use the AWS CLI to copy the objects rather writing your own program.

  • Instead of copying objects once per day, you could configure replication within Amazon S3 so that objects are copied as soon as they are created . (Although I haven't tried this with the same source and destination bucket.)

  • If you need to be more selective about the objects to copy immediately, you could configure Amazon S3 to trigger an AWS Lambda function whenever a new object is created. The Lambda function could apply some business logic to determine whether to copy the object, and then it can issue the copy() command.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM