简体   繁体   中英

Populate dynamodb table from kinesis stream/firehose

Problem

What is the recommended way to populate a dynamodb table with data coming from a kinesis datasource (stream or firehose)?

Current workflow

  • Data is ingested into kinesis firehose
  • lambda triggers on every record written to kinesis firehose and sends the data to dynamodb

Why

I would like to get some advice on this because

  • I am not sure if this approach isn't creating more work than necessary. Ie I need to write and maintain code for the lambda
  • I see that I can configure the likes of redshift or s3 as a consumer of my kinesis datasource. Why can't I do the same with dynamodb? Is there a reason for this? Are other people not using this kind of workflow?

My opinion is, your workflow is currently more or less the right way to do it. The only thing I would change is, I would use Kinesis Streams instead of Firehose. You can, then, configure your stream as your Lambda event source and there is an option to configure batch size. This will greatly decrease your lambda costs, because instead of one lambda execution per record, you will have one lambda execution per each batch (size of 500 records for example). Details are explained in AWS documentation ( https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html )

I am not exactly sure about the real reasons behind not providing DynamoDB as a destination. My guess is; Kinesis doesn't know the structure of your content. Current destinations of Kinesis either have some mechanism to structure incoming data for their needs or they dont care at all about the object structure (S3). On the other hand, DynamoDB requires some decisions from the user. And those architectural decisions are highly important for each table (performance, cost, partitioning, access patterns, etc). Which field will be your partition key, will you use a sort key? Will you format any of your fields? How will you make sure your primary key values are unique? What will be the type of each field (String, Decimal, etc)? I think, Lambda is the most suitable mechanism for those decisions because of its flexibility.

There are some automated mechanisms to infer schema from the data itself (like AWS Glue uses), but in DynamoDB case, it is not simple.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM