简体繁体 English

从 kinesis 流/firehose 填充 dynamodb 表

[英]Populate dynamodb table from kinesis stream/firehose

原文 2019-07-05 10:31:16 9 1 amazon-web-services/ amazon-dynamodb/ amazon-kinesis

Problem问题

What is the recommended way to populate a dynamodb table with data coming from a kinesis datasource (stream or firehose)?使用来自 kinesis 数据源（流或流水）的数据填充 dynamodb 表的推荐方法是什么？

Current workflow当前工作流程

Data is ingested into kinesis firehose数据被摄取到 kinesis firehose 中
lambda triggers on every record written to kinesis firehose and sends the data to dynamodb lambda 在写入 kinesis firehose 的每条记录上触发，并将数据发送到 dynamodb

Why为什么

I would like to get some advice on this because我想就此获得一些建议，因为

I am not sure if this approach isn't creating more work than necessary.我不确定这种方法是否不会造成不必要的工作。 Ie I need to write and maintain code for the lambda即我需要为 lambda 编写和维护代码
I see that I can configure the likes of redshift or s3 as a consumer of my kinesis datasource.我发现我可以将 redshift 或 s3 之类的东西配置为我的 kinesis 数据源的使用者。 Why can't I do the same with dynamodb?为什么我不能用 dynamodb 做同样的事情？ Is there a reason for this?是否有一个原因？ Are other people not using this kind of workflow?其他人不使用这种工作流程吗？

1 个解决方案

My opinion is, your workflow is currently more or less the right way to do it.我的观点是，您的工作流程目前或多或少是正确的方法。 The only thing I would change is, I would use Kinesis Streams instead of Firehose.我唯一要改变的是，我会使用 Kinesis Streams 而不是 Firehose。 You can, then, configure your stream as your Lambda event source and there is an option to configure batch size.然后，您可以将您的流配置为您的 Lambda 事件源，并且有一个用于配置批处理大小的选项。 This will greatly decrease your lambda costs, because instead of one lambda execution per record, you will have one lambda execution per each batch (size of 500 records for example).这将大大降低您的 lambda 成本，因为不是每条记录执行一次 lambda，而是每批执行一次 lambda（例如，500 条记录的大小）。 Details are explained in AWS documentation ( https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html ) AWS 文档 ( https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html ) 中解释了详细信息

I am not exactly sure about the real reasons behind not providing DynamoDB as a destination.我不确定不提供 DynamoDB 作为目的地背后的真正原因。 My guess is;我的猜测是； Kinesis doesn't know the structure of your content. Kinesis 不知道您的内容的结构。 Current destinations of Kinesis either have some mechanism to structure incoming data for their needs or they dont care at all about the object structure (S3). Kinesis 的当前目标要么具有某种机制来根据其需要构建传入数据，要么根本不关心对象结构 (S3)。 On the other hand, DynamoDB requires some decisions from the user.另一方面，DynamoDB 需要用户做出一些决定。 And those architectural decisions are highly important for each table (performance, cost, partitioning, access patterns, etc).这些架构决策对每个表都非常重要（性能、成本、分区、访问模式等）。 Which field will be your partition key, will you use a sort key?哪个字段将是您的分区键，您会使用排序键吗？ Will you format any of your fields?你会格式化你的任何字段吗？ How will you make sure your primary key values are unique?您将如何确保您的主键值是唯一的？ What will be the type of each field (String, Decimal, etc)?每个字段的类型是什么（字符串、十进制等）？ I think, Lambda is the most suitable mechanism for those decisions because of its flexibility.我认为，Lambda 是最适合这些决策的机制，因为它具有灵活性。

There are some automated mechanisms to infer schema from the data itself (like AWS Glue uses), but in DynamoDB case, it is not simple.有一些自动机制可以从数据本身推断架构（如 AWS Glue 使用），但在 DynamoDB 情况下，这并不简单。