简体繁体 English

Kinesis Firehose 将数据从 DynamoDB Steam 传输到 S3：为什么文件中的 JSON 个对象数量不同？

[英]Kinesis Firehose delivers data from DynamoDB Steam to S3: Why the numbers of JSON objects in files is different?

原文 2021-10-06 05:33:34 5 1 amazon-web-services/ amazon-s3

I'm new to AWS, and I'm working on archiving data from DynamoDB to S3.我是 AWS 的新手，我正在致力于将数据从 DynamoDB 归档到 S3。 This is my solution and I have done the pipeline.这是我的解决方案，我已经完成了管道。

DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3 DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3

But I found that the files in S3 has different number of JSON objects.但我发现 S3 中的文件有不同数量的 JSON 个对象。 Some files has 7 JSON objects, some has 6 or 4 objects.有些文件有 7 个 JSON 个对象，有些有 6 个或 4 个对象。 I have done ETL in lambda, the S3 only saves REMOVE item, and the JSON has been unmarshall .我在lambda做了ETL，S3只保存REMOVE item，JSON已经unmarshall了。

I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.我认为这将是一个文件中的 JSON object，因为每个项目的 TTL 值不同，并且 lambda 会在项目被 TTL 删除时立即交付该项目。

Does it because the Kinesis Firehose batches the items?是因为 Kinesis Firehose 对物品进行批处理吗？ (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? （它会在收集更多项目然后将它们保存到文件后等待一段时间）还是有其他原因？ Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?如果 DynamoDB 每 5 分钟有一个新项目被 TTL 删除，我能估计它将保存多少文件吗？

Thank you in advance.先感谢您。

1 个解决方案

Kinesis Firehose splits your data based on buffer size or interval. Kinesis Firehose 根据缓冲区大小或间隔拆分您的数据。

Let's say you have a buffer size of 1MB and an interval of 1 minute.假设您的缓冲区大小为 1MB，间隔为 1 分钟。 If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.如果您在 1 分钟间隔内收到的数据少于 1MB，Kinesis Firehose 无论如何都会根据收到的数据创建一个批处理文件，即使数据少于 1MB。

This is likely happening in scenarios with few data arriving.这可能发生在几乎没有数据到达的情况下。 You can adjust your buffer size and interval to your needs.您可以根据需要调整缓冲区大小和间隔。 Eg You could increase the interval to collect more items within a single batch.例如，您可以增加间隔以在单个批次中收集更多项目。

You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds.您可以选择 1–128 MiB 的缓冲区大小和 60–900 秒的缓冲区间隔。 The condition that is satisfied first triggers data delivery to Amazon S3.先满足的条件触发数据传送到 Amazon S3。

From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html来自 AWS Kinesis Firehose 文档： https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html