简体   繁体   中英

Process large files from S3

I am trying to get a large file (>10gb) on s3 (stored as csv on s3) and send it as a csv in the response header. I am doing it by using the following procedure:

async getS3Object(params:any) {

        s3.getObject(params, function (err, data) {
            if (err) {
              console.log('Error Fetching File');
            }
            else {
                const csv = data.Body.toString('utf-8');
                res.setHeader('Content-disposition', `attachment; filename=${fileId}.csv`);
                res.set('Content-Type', 'text/csv');
                res.status(200).send(csv);
            }
          });

This is taking painfully long to process the file and send it as a csv attachments. How can I make this faster?

You're dealing with a huge file; you could break that into chunks using range (see also the docs, search for "calling the getobject property" ). If you need the whole file, you could split the work off into workers , though at some point the limit will probably be your connection, and if you need to send the whole file as an attachment that won't help much.

A better solution would be to never download the file in the first place. You can do this by streaming from S3 ( see also this , and this ), or setting up a proxy in your server so the bucket/subdir seems to the client to be in your app.

If you run this on EC2, the.network performance of the EC2 instances varies based on the EC2 type and size. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance.network-bandwidth.html

A bottleneck can happen at multiple places:

  • Network (bandwidth and latency)
  • CPU
  • Memory
  • Local Storage

One can check each of these. CloudWatch Metrics is our friend here.

CPU is the easiest to see and to scale with a bigger instance size.

Memory is a bit harder to observe, but one should have enough memory to keep the document in memory, so the OS does not use the swap.

Local Storage - IO can be observed; If the business logic is just to parse a csv file and output the result in, let's say, another S3 bucket, and there is no need to save the file locally - EC2 instances with local storage can be used - https://aws.amazon.com/ec2/instance-types/ - Storage Optimized.

Network - EC2 instance size can be modified, or Network optimized instances can be used.

Network - the way that one connects to S3 matters. Usually, the best approach is the use an S3 VPC endpoint https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-s3.html . The gateway option is free to use. By adopting it, one eliminates the VPC NAT gateway/NAT instance limitations, and it's even more secure.

Network - Sometimes, the S3 is in one region, and the compute is in another. S3 support replication https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

Maybe some type of APM monitoring and code instrumentation can show is the code can also be optimized.

Thank you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM