Kafka 消费者组和分区与 Spark 结构化流

Question

I have a Kafka topic with 3 partitions and I'm consuming that data using spark structured streaming.我有一个包含 3 个分区的 Kafka 主题，我正在使用 Spark 结构化流来消耗该数据。 I have 3 consumers (lets say consumer group A) reading from single partition each, everything is working file till here.我有 3 个消费者（假设消费者组 A）从单个分区读取每个数据，直到这里一切都在工作文件中。

I have a new requirement to read from the same topic and I want to parallelize it by creating 3 consumers (say consumer group B) again each reading from single partition.我有一个从同一主题读取的新要求，我想通过再次创建 3 个消费者（比如消费者组 B）来并行化它，每次从单个分区读取。 As I'm using structured streaming I can't mention group.id explicitly.由于我使用的是结构化流媒体，因此无法明确提及group.id 。

Will consumers from different group pointing to single/same partition read all the data ?指向单个/同一分区的不同组的消费者会读取所有数据吗？

Answer 1

From Spark 3.0.1 documentation :来自 Spark 3.0.1 文档：

By default, each query generates a unique group id for reading data.默认情况下，每个查询都会生成一个唯一的组 ID 用于读取数据。 This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics.这确保了每个 Kafka 源都有自己的消费者组，不会受到任何其他消费者的干扰，因此可以读取其订阅主题的所有分区。

So, if you are using assign option and mentioning which partition to use it will read all data from a specific partition as by it's default nature it will be a different consumer group (group.id).因此，如果您使用assign选项并提及要使用的分区，它将从特定分区读取所有数据，因为它的默认性质将是不同的消费者组（group.id）。 assign option takes json string as a value and can have multiple partitions from different topics as well. assign选项将 json 字符串作为值，并且也可以有来自不同主题的多个分区。 For eg, {"topicA":[0,1],"topicB":[2,4]} .例如， {"topicA":[0,1],"topicB":[2,4]} 。

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("assign", "{"topic-name":[0]}")
  .load()

Answer 2

Use can use group.id as below for streaming使用可以使用 group.id 如下进行流式传输

String processingGroup = "processingGroupA"; String processingGroup = "processingGroupA";

Dataset<Row> raw_df = sparkSession
                      .readStream()
                      .format("kafka")
                      .option("kafka.bootstrap.servers", consumerAppProperties.getProperty(BOOTSTRAP_SERVERS_CONFIG))
                      .option("subscribe", topicName) 
                      .option("startingOffsets", "latest")
                      .option("group.id",  processingGroup)
                      .load();

Answer 3

Unless you are using Spark 3.x or higher, you will not be able to set the group.id in your Kafka input stream.除非您使用 Spark 3.x 或更高版本，否则您将无法在 Kafka 输入流中设置group.id 。 Using Spark 3.x you could, as you have mentioned, have two different Structured Streaming jobs providing two different group.id to ensure that each job reads all message of the topic independent of the other job.正如您所提到的，使用 Spark 3.x，您可以拥有两个不同的结构化流作业，提供两个不同的 group.id，以确保每个作业独立于其他作业读取主题的所有消息。

For Spark versions <= 2.4.x, Spark itself will create a unique Consumer Group for you as you can look up in the code on GitHub :对于 Spark 版本 <= 2.4.x，Spark 本身将为您创建一个独特的消费者组，您可以在 GitHub 上的代码中查找：

// Each running query should use its own group id. Otherwise, the query may be only 
// assigned partial data since Kafka will assign partitions to multiple consumers having
// the same group id. Hence, we should generate a unique id for each query.
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

So, also in that case, having two different Streaming Jobs will ensure that you have two different ConsumerGroup which allows both jobs to read all messages from the topic independent of the other job.因此，同样在这种情况下，拥有两个不同的流作业将确保您拥有两个不同的 ConsumerGroup，这允许两个作业独立于另一个作业读取来自主题的所有消息。

Kafka 消费者组和分区与 Spark 结构化流

问题描述

3 个解决方案

解决方案1
2 2020-11-21 17:40:35

解决方案2
0 2019-06-10 11:01:43

解决方案3
0 2020-11-21 16:44:50

Kafka 消费者组和分区与 Spark 结构化流

问题描述

3 个解决方案

解决方案1 2 2020-11-21 17:40:35

解决方案2 0 2019-06-10 11:01:43

解决方案3 0 2020-11-21 16:44:50

解决方案1
2 2020-11-21 17:40:35

解决方案2
0 2019-06-10 11:01:43

解决方案3
0 2020-11-21 16:44:50