简体繁体 English

Kafka消费者如何从多个分配的分区中消费

[英]How does Kafka Consumer Consume from Multiple assigned Partition

原文 2019-02-07 05:44:18 3 1 java/ apache-kafka/ kafka-consumer-api

tl;dr; TL;博士; I am trying to understand how a single consumer that is assigned multiple partitions handles consuming records for reach partition. 我试图了解分配了多个分区的单个使用者如何处理消耗分区的记录。

For example: 例如：

Completely processes a single partition before moving to the next. 在移动到下一个分区之前完全处理单个分区。
Process a chunk of available records from each partition every time. 每次处理每个分区的一大块可用记录。
Process a batch of N records from first available partitions 从第一个可用分区处理一批N条记录
Process a batch of N records from partitions in round-robin rotation 在循环轮换中处理来自分区的一批N条记录

I found the partition.assignment.strategy configuration for Ranged or RoundRobin Assignors but this only determines how consumers are assigned partitions not how it consumes from the partitions it is assigned to. 我发现partition.assignment.strategy配置Ranged或RoundRobin转让但这只是决定消费者是如何分配的分区，它从它被分配到分区而不是如何消耗。

I started digging into the KafkaConsumer source and #poll() lead me to the #pollForFetches() #pollForFetches() then lead me to fetcher#fetchedRecords() and fetcher#sendFetches() 我开始深入研究KafkaConsumer源代码，然后#poll（）引导我进入#pollForFetches（）＃pixForFetches（）然后引导我到fetcher＃fetchedRecords（）和fetcher＃sendFetches（）

This just lead me to try to follow along the entire Fetcher class all together and maybe it is just late or maybe I just didn't dig in far enought but I am having trouble untangling exactly how a consumer will process multiple assigned partitions. 这只是让我尝试一起跟随整个Fetcher课程，也许它只是迟到或者我只是没有深入挖掘但我无法解决消费者将如何处理多个指定的分区。

Background 背景

Working on a data pipeline backed by Kafka Streams. 处理由Kafka Streams支持的数据管道。

At several stages in this pipeline as records are processed by different Kafka Streams applications the stream is joined to compacted topics feed by external data sources that provide the required data that will be augmented in the records before continuing to the next stage in processing. 在此管道中的几个阶段，由于记录由不同的Kafka Streams应用程序处理，因此流将连接到由外部数据源提供的压缩主题，外部数据源提供将在继续进入下一阶段处理之前在记录中扩充的所需数据。

Along the way there are several dead letter topics where the records could not be matched to external data sources that would have augmented the record. 在此过程中，有几个死信主题，其中记录无法与可能增加记录的外部数据源匹配。 This could be because the data is just not available yet (Event or Campaign is not Live yet) or it it is bad data and will never match. 这可能是因为数据尚未可用（事件或广告系列尚未投放），或者它是错误数据且永远不会匹配。

The goal is to republish records from the dead letter topic when ever new augmented data is published so that we can match previously unmatched records from the dead letter topic in order to update them and send them down stream for additional processing. 目标是在发布新的增强数据时重新发布死信主题中的记录，以便我们可以匹配死信主题中以前不匹配的记录，以便更新它们并将它们发送到下游以进行其他处理。

Records have potentially failed to match on several attempts and could have multiple copies in the dead letter topic so we only want to reprocess existing records (before latest offset at the time the application starts) as well as records that were sent to the dead letter topic since the last time the application ran (after the previously saved consumer group offsets). 记录可能无法在多次尝试中匹配，并且可能在死信主题中有多个副本，因此我们只想重新处理现有记录（在应用程序启动时的最新偏移之前）以及发送到死信主题的记录自上次运行应用程序以来（在先前保存的消费者组偏移之后）。

It works well as my consumer filters out any records arriving after the application has started, and my producer is managing my consumer group offsets by committing the offsets as part of the publishing transaction. 它很好用，因为我的消费者过滤掉了应用程序启动后到达的任何记录，我的生产者通过提交偏移作为发布交易的一部分来管理我的消费者群体抵消。

But I want to make sure that I will eventually consume from all partitions as I have ran into an odd edge case where unmatached records get reprocessed and land in the same partition as before in the dead letter topic only to get filtered out by the consumer. 但是我想确保我最终将从所有分区中消耗，因为我遇到了一个奇怪的边缘情况，其中未匹配的记录被重新处理并落入与死信主题中相同的分区，仅被消费者过滤掉。 And though it is not getting new batches of records to process there are partitions that have not been reprocessed yet either. 虽然没有获得新批次的记录，但是还有一些分区还没有被重新处理。

Any help understanding how a single consumer processes multiple assigned partitions would be greatly appreciated. 任何帮助了解单个消费者如何处理多个分配的分区将不胜感激。

1 个解决方案

You were on the right tracks looking at Fetcher as most of the logic is there. 由于大多数逻辑存在，你在正确的轨道上看着Fetcher 。

First as the Consumer Javadoc mentions: 首先，消费者Javadoc提到：

If a consumer is assigned multiple partitions to fetch data from, it will try to consume from all of them at the same time, effectively giving these partitions the same priority for consumption. 如果为消费者分配了多个分区来从中获取数据，它将尝试同时使用所有这些分区，从而有效地为这些分区提供相同的优先级以供消费。

As you can imagine, in practice, there are a few things to take into account. 可以想象，在实践中，有一些事情需要考虑。

Each time the consumer is trying to fetch new records, it will exclude partitions for which it already has records awaiting (from a previous fetch). 消费者每次尝试获取新记录时，都会排除已经有记录等待的分区（来自之前的提取）。 Partitions that already have a fetch request in-flight are also excluded. 已排除已在传输中获取请求的分区也将被排除在外。
When fetching records, the consumer specifies fetch.max.bytes and max.partition.fetch.bytes in the fetch request. 在获取记录时，使用者在获取请求中指定fetch.max.bytes和max.partition.fetch.bytes 。 These are used by the brokers to respectively determine how much data to return in total and per partition. 经纪人使用这些来分别确定总共和每个分区要返回的数据量。 This is equally applied to all partitions. 这同样适用于所有分区。

Using these 2 approaches, by default, the Consumer tries to consume from all partitions fairly. 使用这两种方法，默认情况下，Consumer会尝试公平地使用所有分区。 If that's not the case, changing fetch.max.bytes or max.partition.fetch.bytes usually helps. 如果不是这种情况，更改fetch.max.bytes或max.partition.fetch.bytes通常会max.partition.fetch.bytes帮助。

In case, you want to prioritize some partitions over others, you need to use pause() and resume() to manually control the consumption flow. 如果您希望将某些分区优先于其他分区，则需要使用pause()和resume()来手动控制消耗流。