简体繁体 English

如何在同一台机器上运行数百名Kafka消费者？

[英]How to run hundreds of Kafka consumers on the same machine?

原文 2019-07-06 14:37:00 1 1 java/ multithreading/ apache-kafka

In Kafka docs, it is mentioned that the consumers are not Thread-Safe. 在Kafka文档中，提到消费者不是线程安全的。 To avoid this problem, I read that it is a good idea to run a consumer for every Java process. 为了避免这个问题，我读到为每个Java进程运行一个使用者是个好主意。 How can this be achieved? 怎么能实现这一目标？

The number of consumers is not defined, but can change according to need. 消费者的数量没有定义，但可以根据需要改变。

Thank, Alessio 谢谢，Alessio

1 个解决方案

You're right that the documentation specifies that Kafka consumers are not thread-safe. 您的文档指定Kafka使用者不是线程安全的，这是对的。 However, it also says that you should run consumers on separate threads , not processes. 但是，它还说你应该在不同的线程而不是进程上运行使用者。 That's quite different. 那是完全不同的。 See here for an answer with more specifics, geared towards Java/JVM: https://stackoverflow.com/a/15795159/236528 请参阅此处以获取更具体的答案，面向Java / JVM： https ： //stackoverflow.com/a/15795159/236528

In general, you can have as many consumers as you want on a Kafka topic. 通常，您可以在Kafka主题上拥有任意数量的消费者。 Some of these might share a group id , in which case, all the partitions for that topic will be distributed across all the consumers active at any point in time. 其中一些可能共享组ID ，在这种情况下，该主题的所有分区将分布在任何时间点活动的所有消费者中。

There's much more detail on the Javadoc for the Kafka Consumer, linked at the bottom of this answer, but I copied the two thread/consumer models suggested by the documentation below. 关于Kafka Consumer的Javadoc有更多细节，链接在这个答案的底部，但我复制了下面文档中建议的两个线程/消费者模型。

1. One Consumer Per Thread 1.每个线程一个消费者

A simple option is to give each thread its own consumer instance. 一个简单的选择是为每个线程提供自己的消费者实例。 Here are the pros and cons of this approach: 以下是此方法的优缺点：

PRO: It is the easiest to implement PRO：这是最容易实现的

PRO: It is often the fastest as no inter-thread co-ordination is needed PRO：它通常是最快的，因为不需要线程间的协调

PRO: It makes in-order processing on a per-partition basis very easy to implement (each thread just processes messages in the order it receives them). PRO：它使每个分区的有序处理非常容易实现（每个线程只按接收顺序处理消息）。

CON: More consumers means more TCP connections to the cluster (one per thread). CON：更多的消费者意味着与群集的TCP连接更多（每个线程一个）。 In general Kafka handles connections very efficiently so this is generally a small cost. 一般来说，Kafka非常有效地处理连接，因此这通常是一个很小的成本。

CON: Multiple consumers means more requests being sent to the server and slightly less batching of data which can cause some drop in I/O throughput. CON：多个使用者意味着更多的请求被发送到服务器，而数据的批量略少，这可能导致I / O吞吐量的一些下降。

CON: The number of total threads across all processes will be limited by the total number of partitions. CON：所有进程中的总线程数将受到分区总数的限制。

2. Decouple Consumption and Processing 2.消耗和加工

Another alternative is to have one or more consumer threads that do all data consumption and hands off ConsumerRecords instances to a blocking queue consumed by a pool of processor threads that actually handle the record processing. 另一种方法是让一个或多个消费者线程执行所有数据消耗，并将ConsumerRecords实例移交给实际处理记录处理的处理器线程池所消耗的阻塞队列。 This option likewise has pros and cons: 这个选项同样有利有弊：

PRO: This option allows independently scaling the number of consumers and processors. PRO：此选项允许独立扩展消费者和处理器的数量。 This makes it possible to have a single consumer that feeds many processor threads, avoiding any limitation on partitions. 这使得可以让单个消费者提供许多处理器线程，从而避免对分区的任何限制。

CON: Guaranteeing order across the processors requires particular care as the threads will execute independently an earlier chunk of data may actually be processed after a later chunk of data just due to the luck of thread execution timing. CON：保证处理器之间的顺序需要特别小心，因为线程将独立执行，因为线程执行时间的好运，实际上可以在稍后的数据块之后处理较早的数据块。 For processing that has no ordering requirements this is not a problem. 对于没有订购要求的处理，这不是问题。

CON: Manually committing the position becomes harder as it requires that all threads co-ordinate to ensure that processing is complete for that partition. CON：手动提交位置变得更加困难，因为它要求所有线程协调以确保该分区的处理完成。 There are many possible variations on this approach. 这种方法有许多可能的变化。 For example each processor thread can have its own queue, and the consumer threads can hash into these queues using the TopicPartition to ensure in-order consumption and simplify commit. 例如，每个处理器线程可以拥有自己的队列，并且使用者线程可以使用TopicPartition散列到这些队列中，以确保按顺序使用并简化提交。

In my experience, option #1 is the best for starting out, and you can upgrade to option #2 only if you really need it. 根据我的经验，选项＃1是最好的开始，只有在您真正需要时才能升级到选项＃2。 Option #2 is the only way to extract the maximum performance from the kafka consumer, but its implementation is more complex. 选项＃2是从kafka消费者中提取最大性能的唯一方法，但其实现更复杂。 So, give option #1 a try first, and see if it's good enough for your specific use case. 因此，首先尝试选项＃1，看看它是否足以满足您的特定用例。

The full Javadoc is available at this link: https://kafka.apache.org/23/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html 完整的Javadoc可以在以下链接获得： https ： //kafka.apache.org/23/javadoc/index.html？ org /apache / kafka / client / cosumer / KafkaConsumer.html