简体   繁体   English

Kinesis Firehose 将数据从 DynamoDB Steam 传输到 S3:为什么文件中的 JSON 个对象数量不同?

[英]Kinesis Firehose delivers data from DynamoDB Steam to S3: Why the numbers of JSON objects in files is different?

I'm new to AWS, and I'm working on archiving data from DynamoDB to S3.我是 AWS 的新手,我正在致力于将数据从 DynamoDB 归档到 S3。 This is my solution and I have done the pipeline.这是我的解决方案,我已经完成了管道。

DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3 DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3

But I found that the files in S3 has different number of JSON objects.但我发现 S3 中的文件有不同数量的 JSON 个对象。 Some files has 7 JSON objects, some has 6 or 4 objects.有些文件有 7 个 JSON 个对象,有些有 6 个或 4 个对象。 I have done ETL in lambda, the S3 only saves REMOVE item, and the JSON has been unmarshall .我在lambda做了ETL,S3只保存REMOVE item,JSON已经unmarshall了。

I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.我认为这将是一个文件中的 JSON object,因为每个项目的 TTL 值不同,并且 lambda 会在项目被 TTL 删除时立即交付该项目。

Does it because the Kinesis Firehose batches the items?是因为 Kinesis Firehose 对物品进行批处理吗? (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? (它会在收集更多项目然后将它们保存到文件后等待一段时间)还是有其他原因? Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?如果 DynamoDB 每 5 分钟有一个新项目被 TTL 删除,我能估计它将保存多少文件吗?

Thank you in advance.先感谢您。

Kinesis Firehose splits your data based on buffer size or interval. Kinesis Firehose 根据缓冲区大小或间隔拆分您的数据。

Let's say you have a buffer size of 1MB and an interval of 1 minute.假设您的缓冲区大小为 1MB,间隔为 1 分钟。 If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.如果您在 1 分钟间隔内收到的数据少于 1MB,Kinesis Firehose 无论如何都会根据收到的数据创建一个批处理文件,即使数据少于 1MB。

This is likely happening in scenarios with few data arriving.这可能发生在几乎没有数据到达的情况下。 You can adjust your buffer size and interval to your needs.您可以根据需要调整缓冲区大小和间隔。 Eg You could increase the interval to collect more items within a single batch.例如,您可以增加间隔以在单个批次中收集更多项目。

You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds.您可以选择 1–128 MiB 的缓冲区大小和 60–900 秒的缓冲区间隔。 The condition that is satisfied first triggers data delivery to Amazon S3.先满足的条件触发数据传送到 Amazon S3。

From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html来自 AWS Kinesis Firehose 文档: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 以 Kinesis Firehose output 格式将 DynamoDB 数据传输到 S3 - DynamoDB data to S3 in Kinesis Firehose output format Kinesis Firehose 将 JSON 个对象放入 S3 中,没有分隔符逗号 - Kinesis Firehose putting JSON objects in S3 without seperator comma 使用来自 Kinesis Data Stream 源的 Kinesis Firehose Delivery Stream 将数据写入 S3 时出现问题 - Problem writing data to S3 with Kinesis Firehose Delivery Stream from Kinesis Data Stream source 读取 Amazon Kinesis Firehose 写入 s3 的数据 stream - Reading the data written to s3 by Amazon Kinesis Firehose stream Kinesis firehose 写入 S3 但访问被拒绝 - Kinesis firehose writes to S3 but access denied Kinesis Firehose 写入 S3 云监视订阅过滤器,但文件不可读 - Kinesis Firehose writes to S3 cloud watch subscription filter but the files are not readable 如何在 Kinesis Firehose 中的 JSON 个对象之间指定分隔符 - How to specify delimiter between JSON objects in Kinesis Firehose 按事件时间对 Kinesis firehose S3 记录进行分区 - Partition Kinesis firehose S3 records by event time AWS Kinesis Data Firehose 和 Lambda - AWS Kinesis Data Firehose and Lambda 从数据 stream (Kinesis) 到 OpenSearch AWS 创建交付 stream (Firehose) - Create delivery stream (Firehose) from data stream (Kinesis) to OpenSearch AWS
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM