简体   繁体   中英

Amazon S3 - multipart upload vs split files-then-upload

I am currently trying to upload files from local to S3 using python. I have extremely large files (over 10 GB) and when I went through some best practices for faster upload, I came across multipart upload. If I understood rightly, multipart upload does the below things:

  1. Split the file into a number of chunks.
  2. Upload each of these chunks to S3 (either serially or in parallel based on our code).
  3. Once the upload of each of these chunks are over, S3 takes care of the final assembling of individual chunks into a single final object/file.

Since, after the uploads of all the chunks are over, it is obvious that multipart upload assembles everything into a single object. But, I want to keep the individual parts as it is or find another way to split the files and upload using python boto's put_object method. This is because, I want the individual chunks/parts of the file to be read in parallel from S3 for my further processing. Is there a way to do this or should I stick to the traditional way of splitting the file by myself and uploading them in parallel (for faster upload).

Thanks in advance.

We had the same problem and here is the approach we took.

Enable Transfer Acceleration

to your bucket.

https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html

If your upload bandwidth is limited, there is no point in splitting the files.

If you have enormous upload bandwidth and your single accelerated endpoint is not consuming the whole upload bandwidth, you can split the files and upload them with multipart.

Upload a single S3 Object/File with multiparts:

A detailed instruction is covered in the following link.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-multipart-upload-cli/

Create Multipart Upload:

aws s3api create-multipart-upload --bucket multirecv --key testfile --metadata md5= mvhFZXpr7J5u0ooXDoZ/4Q==

Upload File Parts:

aws s3api upload-part --bucket multirecv --key testfile --part-number 1 --body testfile.001 --upload-id sDCDOJiTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk --content-md5 Vuoo2L6aAmjr+4sRXUwf0w==

Complete Upload:

aws s3api list-parts --bucket multirecv --key testfile --upload-id sDCDOJiTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk

Hope it helps.

EDIT1

Partial Read from S3:

With S3 you don't need to read the full object. You can specify the start range and end range of the object. You don't need to maintain the splits in S3. You can maintain as single object. Below command will help you to read it partially.

One more benefit is, you can read them parallely as well.

aws s3api get-object --bucket my_bucket --key object/location/file.txt file1.range-1000-2000.txt --range bytes=1000-2000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM