简体繁体 English

Kafka Streams 在监听具有多个分区的主题时如何确保处理所有相关数据？

[英]How to ensure for Kafka Streams when listening to topics with multiple partitions that all related data is processed?

原文 2020-04-15 16:26:38 6 1 apache-kafka/ stream/ apache-kafka-streams

I would like to know how Kafka Streams are assigned to partitions of topics for reading.我想知道如何将 Kafka Streams 分配给主题分区以供阅读。 As far as I understand it, each Kafka Stream Thread is a Consumer (and there is one Consumer Group for the Stream).据我了解，每个 Kafka Stream 线程都是一个消费者（并且流有一个消费者组）。 So I guess the Consumers are randomly assigned to the partitions.所以我猜消费者被随机分配到分区。

But how does it work, if I have multiple input topics which I want to join?但是，如果我有多个要加入的输入主题，它是如何工作的？

Example:例子：

Topic P contains persons.主题 P包含人员。 It has two partitions.它有两个分区。 The key of the message is the person-id so each message which belongs to a person always ends up in the same partition.消息的键是 person-id，所以属于一个人的每条消息总是在同一个分区中结束。

Topic O contains orders.主题 O包含订单。 It has two partitions.它有两个分区。 Lets say the key is also the person-id (of the person who ordered something).假设键也是（订购某物的人的）人名。 So here, too, each order-message which belongs to a person always ends up in the same partition.所以在这里，属于一个人的每个订单消息也总是在同一个分区中结束。

Now I have stream which which reads from both topics and counts all orders per person and writes it to another topic (where the message also includes the name of the person).现在我有 stream 它从两个主题中读取并计算每个人的所有订单并将其写入另一个主题（其中消息还包括人名）。

Data in topic P :主题 P中的数据：

Partition 1 : "hans, id=1" , "maria, id=3"分区 1 : "hans, id=1" , "maria, id=3"

Partition 2 : "john, id=2"分区 2 ： "john, id=2"

Data in topic O :主题 O中的数据：

Partition 1 : "person-id=2, pizza" , "person-id=3, cola"分区 1 : "person-id=2, pizza" , "person-id=3, cola"

Partition 2 : "person-id=1, lasagne"分区 2 ： "person-id=1, lasagne"

And now I start two streams.现在我开始两个流。

Then this could happen:那么这可能会发生：

Stream 1 is assigned to topic P partition 1 and topic O partition 1 . Stream 1分配给主题 P 分区 1和主题 O 分区 1 。

Stream 2 is assigned to topic P partition 2 and topic O partition 2 . Stream 2分配给主题 P 分区 2和主题 O 分区 2 。

This means that the order lasagne for hans would never get counted , because for that a stream would need to consume topic P partition 1 and topic O partition 2 .这意味着永远不会计算hans的订单千层面，因为为此lasagne将需要消耗主题 P 分区 1和主题 O 分区 2 。

So how to handle that problem?那么如何处理这个问题呢？ I guess its fairly common that streams need to somehow process data which relates to each other.我想流需要以某种方式处理彼此相关的数据是相当普遍的。 So it must be ensured that the relating data (here: hans and lasagne ) is processed by the same stream.因此必须确保相关数据（此处为hans和lasagne ）由相同的 stream 处理。

I know this problem does not occur if there is only one stream or if the topics only have one partition.我知道如果只有一个 stream 或者主题只有一个分区，则不会发生此问题。 But I want to be able to concurrently process messages.但我希望能够同时处理消息。

Thanks谢谢

1 个解决方案

Your use case is a KStream-KTable join where KTable store info of Users and KStream is the stream of Order, so the 2 topics have to be co-partitioned which they must have same partitions number and partitioned by the same key and Partitioner.您的用例是KStream-KTable 连接，其中 KTable 存储用户的信息，KStream 是 Order 的 stream，因此这两个主题必须co-partitioned ，它们必须具有相同的分区号并由相同的键和分区器进行分区。 If you're using person-id as key for kafka messages, and using the same Partitioner you should not worry about this case, cause they are on the same partition number.如果您使用person-id作为 kafka 消息的键，并且使用相同的 Partitioner，则不必担心这种情况，因为它们位于相同的分区号上。

Updated : As Matthias pointed out each Stream Thread has it's own Consumer instance.更新：正如 Matthias 指出的，每个 Stream 线程都有自己的消费者实例。