简体繁体 English

我应该使用哪种AWS服务来处理大型文本文件？

[英]Which AWS service should I use to process large text file?

原文 2018-06-24 18:38:54 7 1 amazon-web-services/ amazon-s3/ aws-lambda/ amazon-kinesis

I have a use-case where I need to read a very large text file that can contain up to 1 million records. 我有一个用例，需要读取一个非常大的文本文件，其中最多可以包含一百万条记录。 For each record, I have to perform some validation and then transform it into a different JSON and then push it to an SNS Topic. 对于每条记录，我必须执行一些验证，然后将其转换为其他JSON，然后将其推送到SNS主题。 I don't need to read them sequentially hence I can use parallelism. 我不需要顺序阅读它们，因此可以使用并行性。 One option is to put the file in an S3 bucket then use a lambda to process the file which fans out (asynchronously) the records to multiple lambda functions which take care of transforming(& validation) then pushing it to an SNS. 一种选择是将文件放入S3存储桶中，然后使用lambda处理文件，该文件将记录（散开地）散发到多个lambda函数中，这些函数负责转换（和验证），然后将其推送到SNS。 The other option is to use kinesis stream and use multiple lambdas doing the same thing. 另一种选择是使用运动学流，并使用多个lambda执行相同的操作。 Multiple Lambdas using kinesis streams 使用运动学流的多个Lambda

What should be the ideal way to do this? 理想的方法是什么？

S3 -> Lambda -> Multiple Lambdas -> SNS S3-> Lambda->多个Lambdas-> SNS
Kinesis -> Multiple Lambdas (or Lambda -> Multiple Lambdas -> SNS) Kinesis->多个Lambda（或Lambda->多个Lambdas-> SNS）

1 个解决方案

You might want to look into AWS Glue. 您可能需要研究AWS Glue。 This service can perform ETL on most of the things stored in S3, so it might save you the hassle of doing that by yourself. 该服务可以对S3中存储的大多数内容执行ETL，因此它可以避免您自己执行此操作的麻烦。 Combined of S3 triggering Lambda this might be an interesting option? 结合S3触发Lambda，这可能是一个有趣的选择？

Edit: If the file can be parsed with RegExs, perhaps try Athena? 编辑：如果可以使用RegExs解析文件，也许尝试Athena？ Athena is relatively cheap and can handle larger files without a hitch. 雅典娜相对便宜，可以轻松处理更大的文件。

If the records have predictable length you could use Range requests get divide the file before you pass it onto Lambda, preventing long run times. 如果记录的长度可预测，则可以使用Range请求将文件传递给Lambda之前对文件进行分割，以防止运行时间过长。

Also, have you tried parsing and chunking the file with Lambda? 另外，您是否尝试过使用Lambda解析和分块文件？ 1 million records isn't THAT much and simply line splitting and handing (chunks) off to a validation (or perhaps SNS) shouldn't be an issue. 100万条记录并不多，简单地将行拆分和处理（大块）交给验证（或SNS）就不成问题。