简体繁体 English

如何根据持续时间（例如 1 小时）聚合来自 Kafka 主题的消息？

[英]How is it possible to aggregate messages from Kafka topic based on duration (e.g. 1h)?

原文 2023-01-14 21:57:42 7 1 apache-kafka/ streaming/ aggregation

We are streaming messages to a Kafka topic at a rate of a few hundred per second.我们正在以每秒几百条的速度将消息流式传输到 Kafka 主题。 Each message has a timestamp and a payload.每条消息都有一个时间戳和一个有效负载。 Ultimately, we would like aggregate one hour worth of data - based on the timestamp of the message - into parquet files and upload them to a cheap remote storage (object-store).最终，我们希望根据消息的时间戳将一小时的数据聚合到 parquet 文件中，并将它们上传到便宜的远程存储（对象存储）。

A naive approach would be to have the consumer simply read the messages from the topic and do the aggregation/roll-up in memory, and once there is one hour worth of data, generate and upload the parquet file.一种天真的方法是让消费者简单地从主题中读取消息并在 memory 中进行聚合/汇总，一旦有一个小时的数据，就生成并上传 parquet 文件。

However, in case the consumer crashes or needs to be restarted, we would lose all data since the beginning of the current hour - if we use enable.auto.commit=true or enable.auto.commit=false and manually commit after a batch of messages.然而，万一消费者崩溃或需要重新启动，我们将丢失自当前小时开始以来的所有数据 - 如果我们使用enable.auto.commit=true或enable.auto.commit=false并在批处理后手动提交的消息。

A simple solution for the Consumer could be to keep reading until one hour worth of data is in memory, do the parquet file generation (and upload it), and only then call commitAsync() or commitSync() (using enable.auto.commit=false and use an external store to keep track of the offsets).消费者的一个简单解决方案可能是继续读取，直到 memory 中有一个小时的数据，生成 parquet 文件（并上传），然后才调用commitAsync()或commitSync() （使用enable.auto.commit=false并使用外部存储来跟踪偏移量）。

But this would lead to millions of messages not being committed for at least one hour.但这将导致数百万条消息至少在一小时内未提交。 I am wondering if Kafka does even allow to "delay" the commit of messages for so many messages / so long time (I seem to remember to have read about this somewhere but for the life of me I cannot find it again).我想知道 Kafka 是否甚至允许“延迟”这么多消息/这么长时间的消息提交（我似乎记得在某个地方读过这个但是对于我的生活我再也找不到它了）。

Actual questions:实际问题：

a) is there a limit to the number of messages (or duration) not being committed before Kafka possibly considers the Consumer to be broken or stops giving additional messages to the consumer? a）在卡夫卡可能认为消费者被破坏或停止向消费者提供额外消息之前，是否有未提交的消息数量（或持续时间）的限制？ this seems counter-intuitive though, since what would be the purpose of enable.auto.commit=false and managing the offsets in the Consumer (with eg the help of an external database).但这似乎违反直觉，因为enable.auto.commit=false的目的是什么以及管理消费者中的偏移量（例如借助外部数据库）。

b) in terms of robustness/redundancy and scalability, it would be great to have more than one Consumer in the consumer group; b) 在健壮性/冗余性和可扩展性方面，消费者组中有多个Consumer会很棒； if I understand correctly, it is never possible to have more than one Consumer per partition.如果我理解正确的话，每个分区永远不可能有一个以上的消费者。 If we then run more than one Consumer and configure multiple partitions per topic we cannot do this kind of aggregation/roll-up, since now messages will be distributed across Consumers.如果我们随后运行多个消费者并为每个主题配置多个分区，我们将无法进行这种聚合/汇总，因为现在消息将分布在消费者之间。 The only way to work-around this issue would be to have an additional (external) temporary storage for all those messages belonging to such one-hour group, correct?解决此问题的唯一方法是为属于此类一小时组的所有消息提供额外的（外部）临时存储，对吗？

1 个解决方案

You can configure Kafka Streams with a TimestampExtractor to aggregate data into different types of time-windows您可以使用 TimestampExtractor 配置 Kafka Streams 以将数据聚合到不同类型的时间窗口中