简体   繁体   中英

Problem with Streaming large files with Boto3 to S3 on ec2

I am trying to stream large files from HTTP to S3 directly. I rather not download the file and then stream it, i am trying to do it directly. so the source is big file(60GB) that is to be streamed from http server, the dest is s3 bucket.

i have tested on two envoirments: on my WSL env, when memory gets to 100% the script gets killed, setting max_concurrency to 2, nothing really helps, why the heck i still get memory overload?

on Ec2 (micro) machine, which is where i want to run the code, the boto code does not even run or show any error? maybe i need to increase the memory of machine from 1GB to 2-3? but i still would like to keep it on free tier...

Is there anyway to stream such large files directly? when i stream small files, like 1GB or less, its working without a problem..

i think the problem is with memory issues, that the code trys to read the http file into memory and upload, maybe the way is to read it into memory in chunks and stream in chunks? how i do it, i am not python expert.. been working on it for days..


    def stream_to_s3(self, source_filename, remote_filename):
        error = 0
        self.log(f"====> Streaming {source_filename} to S3://{remote_filename}")

        s3 = boto3.resource('s3')
        bucket = s3.Bucket(self.params['UPLOAD_TO_S3']['S3_BUCKET'])
        destination = bucket.Object(remote_filename)

        with self.session.get(source_filename, stream=True) as response:
            GB = 1024 ** 3
            MB = 1024 * 1024
            max_threshold = 5 * GB
            # if int(response.headers['content-length']) > max_threshold:
            TC = TransferConfig(multipart_threshold=max_threshold, max_concurrency=2, multipart_chunksize=8 * MB, use_threads=True)
            try:
                destination.upload_fileobj(response.raw, Config=TC)
            except Exception as e:
                self.log(f"====> Failure streaming file to S3://{remote_filename}. Reason: {e}")
                return 1
        self.log(f"====> Succeeded streaming file to S3://{remote_filename}")

You can use the smart-open package to obtain file objects for input sources and output destinations. This should enable "efficient streaming of very large files".

from smart_open import open

remote_uri = f's3://{remote_filename}'

with open(remote_uri, 'w') as f_out:
    # Assuming requests library.
    for line in response.iter_lines(): 
        f_out.write(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM