简体   繁体   中英

Which AWS service should I use to process large text file?

I have a use-case where I need to read a very large text file that can contain up to 1 million records. For each record, I have to perform some validation and then transform it into a different JSON and then push it to an SNS Topic. I don't need to read them sequentially hence I can use parallelism. One option is to put the file in an S3 bucket then use a lambda to process the file which fans out (asynchronously) the records to multiple lambda functions which take care of transforming(& validation) then pushing it to an SNS. The other option is to use kinesis stream and use multiple lambdas doing the same thing. Multiple Lambdas using kinesis streams

What should be the ideal way to do this?

  1. S3 -> Lambda -> Multiple Lambdas -> SNS
  2. Kinesis -> Multiple Lambdas (or Lambda -> Multiple Lambdas -> SNS)

You might want to look into AWS Glue. This service can perform ETL on most of the things stored in S3, so it might save you the hassle of doing that by yourself. Combined of S3 triggering Lambda this might be an interesting option?

Edit: If the file can be parsed with RegExs, perhaps try Athena? Athena is relatively cheap and can handle larger files without a hitch.

If the records have predictable length you could use Range requests get divide the file before you pass it onto Lambda, preventing long run times.

Also, have you tried parsing and chunking the file with Lambda? 1 million records isn't THAT much and simply line splitting and handing (chunks) off to a validation (or perhaps SNS) shouldn't be an issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM