I am currently trying to upload files from local to S3 using python. I have extremely large files (over 10 GB) and when I went through some best practices for faster upload, I came across multipart upload. If I understood rightly, multipart upload does the below things:
Since, after the uploads of all the chunks are over, it is obvious that multipart upload assembles everything into a single object. But, I want to keep the individual parts as it is or find another way to split the files and upload using python boto's put_object method. This is because, I want the individual chunks/parts of the file to be read in parallel from S3 for my further processing. Is there a way to do this or should I stick to the traditional way of splitting the file by myself and uploading them in parallel (for faster upload).
Thanks in advance.
We had the same problem and here is the approach we took.
Enable Transfer Acceleration
to your bucket.
https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
If your upload bandwidth is limited, there is no point in splitting the files.
If you have enormous upload bandwidth and your single accelerated endpoint is not consuming the whole upload bandwidth, you can split the files and upload them with multipart.
Upload a single S3 Object/File with multiparts:
A detailed instruction is covered in the following link.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-multipart-upload-cli/
Create Multipart Upload:
aws s3api create-multipart-upload --bucket multirecv --key testfile --metadata md5= mvhFZXpr7J5u0ooXDoZ/4Q==
Upload File Parts:
aws s3api upload-part --bucket multirecv --key testfile --part-number 1 --body testfile.001 --upload-id sDCDOJiTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk --content-md5 Vuoo2L6aAmjr+4sRXUwf0w==
Complete Upload:
aws s3api list-parts --bucket multirecv --key testfile --upload-id sDCDOJiTUVGeKAk3Ob7qMynRKqe3ROcavPRwg92eA6JPD4ybIGRxJx9R0VbgkrnOVphZFK59KCYJAO1PXlrBSW7vcH7ANHZwTTf0ovqe6XPYHwsSp7eTRnXB1qjx40Tk
Hope it helps.
EDIT1
Partial Read from S3:
With S3 you don't need to read the full object. You can specify the start range and end range of the object. You don't need to maintain the splits in S3. You can maintain as single object. Below command will help you to read it partially.
One more benefit is, you can read them parallely as well.
aws s3api get-object --bucket my_bucket --key object/location/file.txt file1.range-1000-2000.txt --range bytes=1000-2000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.