简体繁体 English

运行在不同计算机上的Kafka用户组可以接收唯一消息吗？

[英]Can Kafka consumer group running on different machines receive unique messages?

原文 2015-04-08 12:11:16 7 1 java/ message-queue/ apache-kafka

To avoid redundant messages when consumer crashes and comes back up, I have disabled auto commit of offsets and manually committing them. 为了避免使用者崩溃并重新启动时出现冗余消息，我禁用了自动提交偏移量并手动提交它们。

Now the question is if same topic is accessed by consumer processes on different machines, will they receive unique messages? 现在的问题是，如果不同计算机上的使用者进程访问了相同的主题，它们会收到唯一的消息吗？ Looking at it theoretically, manual committing will result into redundant messages received on different machines. 从理论上看，手动提交将导致在不同计算机上接收到冗余消息。

On my local machine I ran two instances of java consumer subscribing to same topic and they got repeated messages. 在我的本地计算机上，我运行了两个使用同一主题的Java使用者实例，并且它们得到了重复的消息。 How to tackle this thing? 如何解决这个问题？ I am using high level consumer 我正在使用高级消费者

1 个解决方案

Since Kafka's message delivery semantic is at-least-once , you should implement your own codes to guarantee exactly-once semantic in Kafka. 由于Kafka的消息传递语义至少一次 ，因此您应该实现自己的代码以保证Kafka中的语义一次。

At most once: Messages may be lost but are never redelivered. 最多一次：邮件可能会丢失，但永远不会重新发送。
At least once: Messages are never lost but may be redelivered. 至少一次：消息永不丢失，但可以重新发送。
Exactly once: this is what people actually want, each message is delivered once and only once. 恰好一次：这是人们真正想要的，每条消息只传递一次，也只有一次。

From 4.6 Message Delivery Semantics in Kafka Documentation: 来自Kafka文档中的4.6消息传递语义：

So what about exactly once semantics (ie the thing you actually want)? 那么，一次语义（即您真正想要的东西）又如何呢？ The limitation here is not actually a feature of the messaging system but rather the need to co-ordinate the consumer's position with what is actually stored as output. 这里的限制实际上不是消息传递系统的功能，而是需要协调消费者的位置和实际存储为输出的内容。 The classic way of achieving this would be to introduce a two-phase commit between the storage for the consumer position and the storage of the consumers output. 实现此目的的经典方法是在消费者位置存储和消费者输出存储之间引入两阶段提交。 But this can be handled more simply and generally by simply letting the consumer store its offset in the same place as its output. 但这可以更简单地处理，通常只需让消费者将其偏移量存储在与输出相同的位置即可。 This is better because many of the output systems a consumer might want to write to will not support a two-phase commit. 这样会更好，因为使用者可能要写入的许多输出系统将不支持两阶段提交。 As an example of this, our Hadoop ETL that populates data in HDFS stores its offsets in HDFS with the data it reads so that it is guaranteed that either data and offsets are both updated or neither is. 例如，在HDFS中填充数据的Hadoop ETL将其偏移量与读取的数据一起存储在HDFS中，从而确保数据和偏移量都不会更新或都不更新。 We follow similar patterns for many other data systems which require these stronger semantics and for which the messages do not have a primary key to allow for deduplication. 对于许多其他需要更强语义的数据系统，我们遵循类似的模式，对于这些数据，消息没有主键来允许重复数据删除。