简体繁体 English

AWS：如何将流式数据保存到托管在 EC2 上的数据库（例如 MySQL/MongoDB）

[英]AWS: How to save Streaming data to database hosted on EC2 ( ex. MySQL/ MongoDB )

原文 2020-01-22 14:43:44 4 3 mongodb/ amazon-web-services/ iot/ amazon-kinesis

We can easily save data between different AWS Services for ex.我们可以轻松地在不同的 AWS 服务之间保存数据，例如。 Kinesis to DynamoDB;运动到 DynamoDB； or AWS IoT to Redshift etc.或 AWS IoT 到 Redshift 等。

But what is best strategy to save streaming data to suppose MongoDB ( which does NOT have AWS PaaS; Atlas is there but it has no integrations with other AWS Services )但是什么是保存流数据的最佳策略来假设 MongoDB（它没有 AWS PaaS；Atlas 在那里，但它没有与其他 AWS 服务集成）

I can see some third party solutions are there;我可以看到有一些第三方解决方案； but what is best strategy to implement on AWS itself...Is execution of lambda function for each insert (batching) the only option?但是在 AWS 本身上实施的最佳策略是什么……为每个插入（批处理）执行 lambda 函数是唯一的选择吗？

3 个解决方案

I am assuming that you are using Kinesis Firehose.我假设您正在使用 Kinesis Firehose。 If that's the case, what you can do is:如果是这种情况，您可以做的是：

From Firehose write to S3 every 5 mins.从 Firehose 每 5 分钟写入一次 S3。
Firehose will create a new file on S3 every 5 mins. Firehose 将每 5 分钟在 S3 上创建一个新文件。
Trigger a Lambda function to read the new file on S3.触发 Lambda 函数以读取 S3 上的新文件。
Write the data of the new file to MongoDB.将新文件的数据写入 MongoDB。

If you are using Kinesis (not firehose), you can simply write a Kinesis consumer which will read data from the Kinesis and write directly yo MongoDB.如果您使用的是 Kinesis（不是 firehose），您可以简单地编写一个 Kinesis 消费者，它将从 Kinesis 读取数据并直接写入 yo MongoDB。

FYI, There is DocumentDB with MongoDB like API, you can use that as AWS Hosted MongoDB仅供参考，DocumentDB 和 MongoDB 类似 API，您可以将其用作 AWS Hosted MongoDB

You can invoke lambda function on each FireHose invocation.您可以在每次 FireHose 调用时调用 lambda 函数。 And this lambda can insert into mongodb hosted on EC2.这个 lambda 可以插入到 EC2 上托管的 mongodb 中。 You can batch operations so as to reduce number of lambda invocations ( and in return reduce cost )您可以批处理操作以减少 lambda 调用的次数（并反过来降低成本）

The solution depends mostly on your use case.解决方案主要取决于您的用例。 How fast do you need to insert the data into your MongoDB?您需要以多快的速度将数据插入 MongoDB？

if you need a near real time solution, then Kinesis and Lambdas is you best option (assuming you don't want to invest in 3rd party products).如果您需要近乎实时的解决方案，那么 Kinesis 和 Lambdas 是您的最佳选择（假设您不想投资第 3 方产品）。 If you can afford a delay and do batching, then you can save the kinesis stream into S3 and then use AWS Glue to process and load your data into the database.如果您可以承受延迟并进行批处理，那么您可以将运动流保存到 S3 中，然后使用 AWS Glue 处理数据并将其加载到数据库中。

What you need to think is mostly what do you need to do with the data.您需要考虑的主要是您需要对数据做什么。

If you are collecting sensor data, where you only care about aggregations (eg clicks in a UI), then it is better if you store the raw data into s3 and then execute a data pipeline (using AWS Glue for example) to store the aggregated data into MongoDB.如果您正在收集传感器数据，您只关心聚合（例如 UI 中的点击），那么最好将原始数据存储到 s3，然后执行数据管道（例如使用 AWS Glue）来存储聚合数据存入 MongoDB。 S3 will be faster and cheaper for those types of data.对于这些类型的数据，S3 会更快、更便宜。

If you are using the stream to pass business entities (eg documents that provide value on their own), then a near real time solution using AWS lambda will be a better choice.如果您使用流来传递业务实体（例如自己提供价值的文档），那么使用 AWS lambda 的近乎实时的解决方案将是更好的选择。

Without knowing the exact use case, I would propose to store in your database only the data that provide value (eg reports on aggregated data) and use S3 with a lifecycle policy for the raw "sensor" data.在不知道确切用例的情况下，我建议仅将提供价值的数据（例如聚合数据报告）存储在您的数据库中，并将 S3 与原始“传感器”数据的生命周期策略一起使用。