简体   繁体   English

Kafka使用者 - 消费者进程和线程与主题分区的关系是什么

[英]Kafka consumer - what's the relation of consumer processes and threads with topic partitions

I have been working with Kafka lately and have bit of confusion regarding the consumers under a consumer group. 我最近一直在与卡夫卡合作,对消费者群体下的消费者有点困惑。 The center of the confusion is whether to implement consumers as processes or threads. 混淆的中心是将消费者实现为流程还是线程。 For this question, assume I am using the high level consumer. 对于这个问题,假设我正在使用高级消费者。

Let's consider a scenario that I have experimented with. 让我们考虑一下我尝试过的场景。 In my topic there are 2 partitions (for simplicity let's assume replication factor is just 1). 在我的主题中有2个分区(为简单起见,我们假设复制因子只有1)。 I created a consumer ( ConsumerConnector ) process consumer1 with group group1 , then created a topic count map of size 2 and then spawned 2 consumer threads consumer1_thread1 and consumer1_thread2 under that process. 我创建了一个使用group1组的消费者( ConsumerConnector )进程consumer1 ,然后创建了一个大小为2的主题计数映射,然后在该进程group1成了2个使用者线程consumer1_thread1consumer1_thread2 It looks like consumer1_thread1 is consuming partition 0 and consumer1_thread2 is consuming partition 1 . 看起来consumer1_thread1正在消耗分区0consumer1_thread2正在消耗分区1 Is this behaviour always deterministic? 这种行为总是确定的吗? Below is the code snippet. 以下是代码段。 Class TestConsumer is my consumer thread class. TestConsumer类是我的消费者线程类。

    ...
    Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
    topicCountMap.put(topic, new Integer(2));
    Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
    List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);

    executor = Executors.newFixedThreadPool(2);

    int threadNumber = 0;
    for (final KafkaStream stream : streams) {
        executor.submit(new TestConsumer(stream, threadNumber));
        threadNumber++;
    }
    ...

Now, let's consider another scenario (which I haven't experimented but am curious) where I start 2 consumer processes consumer1 and consumer2 both having the same group group1 and each of them is a single threaded process. 现在,让我们考虑另一个场景(我没有尝试但很好奇),我开始2个消费者进程consumer1consumer2都具有相同的group1组,每个都是单线程进程。 Now my questions are: 现在我的问题是:

  1. How will the two independent consumer processes (under the same group nevertheless) be related to the partitions in this case ? 在这种情况下,两个独立的消费者流程(在同一组下)如何与分区相关联? How is it different from the above single process multi-thread scenario? 它与上述单进程多线程场景有何不同?

  2. In general, how are consumer threads or processes mapped / related to partitions in the topic? 通常,消费者线程或进程如何映射/与主题中的分区相关?

  3. The Kafka documentation does say that each consumer under a consumer group will consume one partition. Kafka文档确实说消费者组下的每个消费者将使用一个分区。 However, does that refer to a consumer thread (like my above code example) or independent consumer processes? 但是,这是指消费者线程(如我上面的代码示例)还是独立的消费者流程?

  4. Is there any subtle thing I am missing here regarding implementing consumers as processes vs threads? 关于将消费者作为流程与线程实现,我在这里缺少任何微妙的东西吗? Thanks in advance. 提前致谢。

A consumer group can have multiple consumer instances running (multiple process with the same group-id ). 消费者组可以运行多个消费者实例(具有相同group-id多个进程)。 While consuming each partition is consumed by exactly one consumer instance in the group . 虽然消耗每个分区,但组中只有一个消费者实例正在使用它

Eg if your topic contains 2 partitions and you start a consumer group group-A with 2 consumer instances then each one of them will be consuming messages from a particular partition of the topic. 例如,如果您的主题包含2个分区,并且您启动了具有2个使用者实例的使用者组group-A ,那么每个分区将使用来自该主题的特定分区的消息。

If you start the same 2 consumer with different group id group-A & group-B then the message from both partitions of the topic will be broadcast to each one of them. 如果您启动具有不同组ID group-Agroup-B的相同2个使用者,则来自该主题的两个分区的消息将被广播到它们中的每一个。 So in that case the consumer instance running under group-A will have messages from both the partitions of the topic, and same is true for group-B as well. 因此,在这种情况下,在group-A下运行的消费者实例将具有来自该主题的两个分区的消息,对于group-B也是如此。

Read more on this on their documentation 在他们的文档上阅读更多相关内容

EDIT : Based on your comment which says, 编辑 :根据你的评论说,

I was wondering what is the effective difference between having 2 consumer threads under the same process as opposed to 2 consumer processes (group being the same in both cases) 我想知道在同一个流程下有两个消费者线程而不是两个消费者流程之间的有效区别是什么(在两种情况下组都相同)

The consumer group-id is same/global across the cluster. 消费者group-id在整个群集中是相同/全局的。 Suppose you have started process-one with 2 threads and then spawn another process (may be in a different machine) with the same groupId having 2 more threads then kafka will add these 2 new threads to consume messages from the topic. 假设你已经启动了一个有2个线程的进程 - 然后生成另一个进程(可能在不同的机器中),同一个groupId有2个以上的线程,那么kafka将添加这2个新线程来使用来自该主题的消息。 So eventually there will be 4 threads responsible for consuming from the same topic. 因此最终将有4个线程负责从同一主题消费。 Kafka will then trigger a re-balance to re-assign partitions to threads, so it could happen that for a particular partition which was being consumed by thread T1 of process P1 may be allocated to be consumed by thread T2 of process P2 . 然后,Kafka将触发重新平衡以将分区重新分配给线程,因此可能发生对于由T1 of process P1的线程T1 of process P1消耗的特定分区可以被分配为由T2 of process P2线程T2 of process P2消耗。 The below few lines are taken from the wiki page 以下几行来自维基页面

When a new process is started with the same Consumer Group name, Kafka will add that processes' threads to the set of threads available to consume the Topic and trigger a 're-balance'. 当使用相同的使用者组名称启动新进程时,Kafka会将该进程的线程添加到可用于使用Topic并触发“重新平衡”的线程集。 During this re-balance Kafka will assign available partitions to available threads, possibly moving a partition to another process. 在此重新平衡期间,Kafka会将可用分区分配给可用线程,可能会将分区移动到另一个进程。 If you have a mixture of old and new business logic, it is possible that some messages go to the old logic. 如果您混合使用新旧业务逻辑,则某些消息可能会转到旧逻辑。

The main design decision for opting for multiple consumer group instances with the same id vs a single consumer group instance is resiliency. 选择具有相同id而不是单个使用者组实例的多个使用者组实例的主要设计决策是弹性。 For example if you have a single consumer with two threads then if this machine goes down you loose all consumers. 例如,如果您有一个具有两个线程的消费者,那么如果该机器出现故障,您将失去所有消费者。 If you have two separate consumer groups with the same id, each on different hosts then they can survive failure. 如果您有两个具有相同ID的单独的使用者组,每个用户组在不同的主机上,则它们可以在失败时存活 Ideally each consumer group should have two threads in the above, therefore if one host goes down the other consumer group uses its dormant thread to take up the other partition. 理想情况下,每个消费者组在上面应该有两个线程,因此如果一个主机发生故障,另一个消费者组使用其休眠线程来占用另一个分区。 Indeed it is always desirable to have more threads than partitions to cover this factor. 实际上,总是希望拥有比分区更多的线程来覆盖这个因素。

  1. You can run each consumer group on different hosts. 您可以在不同的主机上运行每个使用者组。 With a single consumer group for a given name/id it will only ever run on a single host as it manages all its threads in a single runtime environment. 对于给定名称/ id的单个使用者组,它将仅在单个主机上运行,​​因为它在单个运行时环境中管理其所有线程。
  2. Kafka has an algorithm to determine which threads/consumer groups reads the various topic partitions. Kafka有一个算法来确定哪些线程/消费者组读取各种主题分区。 Kafka tries to evenly distribute these in a resilient fashion. 卡夫卡试图以有弹性的方式均匀地分配这些。 When a consumer group fails, it enables other threads in other groups to read the given partition. 当使用者组失败时,它会使其他组中的其他线程读取给定的分区。
  3. Refers to a single thread in the consumer group. 指消费者组中的单个线程。 If there are more threads than partitions then some of them will just remain dormant until other threads fail to offer resiliancy. 如果线程多于分区,那么其中一些线程将保持休眠状态,直到其他线程无法提供弹性。
  4. The preference relates to resilience. 偏好与弹性有关。 So with multiple consumer groups setup with the same id I can run on multiple hosts making my application tolerant to failure. 因此,对于具有相同ID的多个使用者组,我可以在多个主机上运行,​​使我的应用程序容忍失败。

Thanks for the detailed answer from @user2720864, but I think the re-allocation case @user2720864 mentioned in the answer is not correct => one partition cannot be consumed by two consumers. 感谢@ user2720864的详细解答,但我认为答案中提到的重新分配案例@ user2720864不正确=>两个消费者不能使用一个分区。

When there are more consumers (compared to the partitions), each partition will be exclusively allocated to one consumer only while the leftover consumers will stay lazy only until some working consumers being dead or being removed from the group. 当有更多的消费者(与分区相比)时,每个分区将仅被专门分配给一个消费者,而剩余的消费者将只保留懒惰,直到一些工作消费者死亡或被从组中移除。

Based on the Kafka Consumers document: 基于Kafka消费者文件:

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. 在Kafka中实现消费的方式是通过在消费者实例上划分日志中的分区,以便每个实例在任何时间点都是分配的“公平份额”的独占消费者 This process of maintaining membership in the group is handled by the Kafka protocol dynamically. 维护组中成员资格的过程由Kafka协议动态处理。 If new instances join the group they will take over some partitions from other members of the group; 如果新实例加入该组,他们将从该组的其他成员接管一些分区; if an instance dies, its partitions will be distributed to the remaining instances. 如果实例死亡,其分区将分发给其余实例。

And also its API specification at "Consumer Groups and Topic Subscriptions" section: 以及“消费者群体和主题订阅”部分的API规范

This is achieved by balancing the partitions between all members in the consumer group so that each partition is assigned to exactly one consumer in the group. 这是通过平衡使用者组中所有成员之间的分区来实现的,这样每个分区就可以分配给该组中的一个使用者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM