简体繁体 English

从AWS Lambda Node.JS流式传输并压缩到S3

[英]Stream and zip to S3 from AWS Lambda Node.JS

原文 2017-10-18 14:45:58 8 1 node.js/ amazon-web-services/ amazon-s3/ aws-lambda/ aws-sdk-nodejs

My goal is to create a large gzipped text file and put it into S3. 我的目标是创建一个较大的压缩文本文件并将其放入S3。

The file contents consist of blocks which I read in a loop from another source. 文件内容由我从另一个源中循环读取的块组成。

Because of the size of this file I can not hold all data in memory, so I need to somehow stream it directly to S3 and ZIP at the same time. 由于该文件的大小，我无法在内存中保存所有数据，因此我需要以某种方式直接将其同时流式传输到S3和ZIP。

I understand how to perform this trick with the regular fs in Node.JS, but I am confused about whether is it possible to do the same trick with S3 from AWS Lambda? 我了解如何使用Node.JS中的常规fs来执行此技巧，但是我对于使用AWS Lambda的S3是否可以执行相同的技巧感到困惑？ I know that s3.putObject can consume streamObject , but it seems to me that this stream should be already finalized when I perform putObject operation, what can cause exceeding of the allowed memory. 我知道s3.putObject 可以使用 streamObject ，但是在我看来，当我执行putObject操作时该流应该已经完成，这会导致超出允许的内存。

1 个解决方案

You can stream files (>5mb) into S3 buckets in chunks using multipart upload functions in the NodeJs aws-sdk . 您可以使用NodeJs aws-sdk中的分段上传功能，将文件（> 5mb）分批流式传输到S3存储桶中。

This is not only useful for streaming large files into buckets, but also enables you to retry failed chunks (instead of a whole file) and parallelize upload of individual chunks (with multiple, upload lambdas, which could be useful in a serverless ETL setup for example). 这不仅适用于将大文件流式存储到存储桶中，而且还使您可以重试失败的块（而不是整个文件）并并行上传单个块（带有多个上载lambda），这对于以下情况的无服务器ETL设置很有用：例）。 The order in which they arrive is not important as long as you track them and finalize the process once all have been uploaded. 只要您跟踪它们并在全部上载后完成该过程，它们的到达顺序并不重要。

To use the multipart upload, you should: 要使用分段上传，您应该：

initialize the process using createMultipartUpload and store the returned UploadId (you'll need it for chunk uploads) 使用createMultipartUpload初始化过程，并存储返回的UploadId （对于块上传，您将需要它）
implement a Transform stream that would process data coming from the input stream 实现一个Transform流，该流将处理来自输入流的数据
implement a PassThrough stream which would buffer the data in large enough chunks before using uploadPart to push them to S3 (under the UploadId returned in step 1) 实现PassThrough流，该流将在使用uploadPart将数据推送到S3之前以足够大的块缓冲数据（在步骤1中返回的UploadId下）
track the returned ETags and PartNumbers from chunk uploads 跟踪块上传中返回的ETags和PartNumbers
use the tracked ETags and PartNumbers to assemble/finalize the file on S3 using completeMultipartUpload 使用completeMultipartUpload使用跟踪的ETags和PartNumbers在S3上组装/完成文件

Here's the gist of it in a working code example which streams a file from iso.org , pipes it through gzip and into an S3 bucket. 这是一个工作代码示例的要点，该示例从iso.org流式传输文件，将其通过gzip 传输到S3存储桶中。 Don't forget to change the bucket name and make sure to run the lambda with 512mb of memory on node 6.10. 不要忘记更改存储桶名称，并确保在节点6.10上运行具有512mb内存的lambda。 You can use the code directly in the web GUI since there are no external dependencies. 由于没有外部依赖性，因此可以直接在Web GUI中使用代码。

NOTE : This is just a proof of concept that I put together for demonstration purposes. 注意：这只是我为演示目的而汇总的概念证明。 There is no retry logic for failed chunk uploads and error handling is almost non-existent which can literally cost you (eg abortMultipartUpload should be called upon cancelling the whole process to clean up the uploaded chunks since they remain stored and invisible on S3 even though the final file was never assembled). 没有失败块重试的重试逻辑，错误处理几乎不存在，这可能会导致您付出代价（例如，在取消整个过程以清理上载的块时，应该调用abortMultipartUpload因为它们在S3上仍然存储并且不可见 ，即使最终文件从未汇编过）。 The input stream is being paused instead of queuing upload jobs and utilizing backpressure stream mechanisms etc. 输入流正在暂停，而不是排队上传作业并利用背压流机制等。