简体   繁体   English

Kafka 消费者组和分区与 Spark 结构化流

[英]Kafka consumer group and partitions with Spark structured streaming

I have a Kafka topic with 3 partitions and I'm consuming that data using spark structured streaming.我有一个包含 3 个分区的 Kafka 主题,我正在使用 Spark 结构化流来消耗该数据。 I have 3 consumers (lets say consumer group A) reading from single partition each, everything is working file till here.我有 3 个消费者(假设消费者组 A)从单个分区读取每个数据,直到这里一切都在工作文件中。

I have a new requirement to read from the same topic and I want to parallelize it by creating 3 consumers (say consumer group B) again each reading from single partition.我有一个从同一主题读取的新要求,我想通过再次创建 3 个消费者(比如消费者组 B)来并行化它,每次从单个分区读取。 As I'm using structured streaming I can't mention group.id explicitly.由于我使用的是结构化流媒体,因此无法明确提及group.id

Will consumers from different group pointing to single/same partition read all the data ?指向单个/同一分区的不同组的消费者会读取所有数据吗?

From Spark 3.0.1 documentation :来自 Spark 3.0.1 文档

By default, each query generates a unique group id for reading data.默认情况下,每个查询都会生成一个唯一的组 ID 用于读取数据。 This ensures that each Kafka source has its own consumer group that does not face interference from any other consumer, and therefore can read all of the partitions of its subscribed topics.这确保了每个 Kafka 源都有自己的消费者组,不会受到任何其他消费者的干扰,因此可以读取其订阅主题的所有分区。

So, if you are using assign option and mentioning which partition to use it will read all data from a specific partition as by it's default nature it will be a different consumer group (group.id).因此,如果您使用assign选项并提及要使用的分区,它将从特定分区读取所有数据,因为它的默认性质将是不同的消费者组(group.id)。 assign option takes json string as a value and can have multiple partitions from different topics as well. assign选项将 json 字符串作为值,并且也可以有来自不同主题的多个分区。 For eg, {"topicA":[0,1],"topicB":[2,4]} .例如, {"topicA":[0,1],"topicB":[2,4]}

val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("assign", "{"topic-name":[0]}")
  .load()

Use can use group.id as below for streaming使用可以使用 group.id 如下进行流式传输

String processingGroup = "processingGroupA"; String processingGroup = "processingGroupA";

Dataset<Row> raw_df = sparkSession
                      .readStream()
                      .format("kafka")
                      .option("kafka.bootstrap.servers", consumerAppProperties.getProperty(BOOTSTRAP_SERVERS_CONFIG))
                      .option("subscribe", topicName) 
                      .option("startingOffsets", "latest")
                      .option("group.id",  processingGroup)
                      .load();

Unless you are using Spark 3.x or higher, you will not be able to set the group.id in your Kafka input stream.除非您使用 Spark 3.x 或更高版本,否则您将无法在 Kafka 输入流中设置group.id Using Spark 3.x you could, as you have mentioned, have two different Structured Streaming jobs providing two different group.id to ensure that each job reads all message of the topic independent of the other job.正如您所提到的,使用 Spark 3.x,您可以拥有两个不同的结构化流作业,提供两个不同的 group.id,以确保每个作业独立于其他作业读取主题的所有消息。

For Spark versions <= 2.4.x, Spark itself will create a unique Consumer Group for you as you can look up in the code on GitHub :对于 Spark 版本 <= 2.4.x,Spark 本身将为您创建一个独特的消费者组,您可以在 GitHub 上代码中查找:

// Each running query should use its own group id. Otherwise, the query may be only 
// assigned partial data since Kafka will assign partitions to multiple consumers having
// the same group id. Hence, we should generate a unique id for each query.
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"

So, also in that case, having two different Streaming Jobs will ensure that you have two different ConsumerGroup which allows both jobs to read all messages from the topic independent of the other job.因此,同样在这种情况下,拥有两个不同的流作业将确保您拥有两个不同的 ConsumerGroup,这允许两个作业独立于另一个作业读取来自主题的所有消息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何为火花结构化流指定kafka消费者的组ID? - How to specify the group id of kafka consumer for spark structured streaming? 解释 Spark Structured Streaming executor 和 Kafka partitions 之间的映射 - Explain mapping between Spark Structured Streaming executors and Kafka partitions Spark Streaming:Spark Structured Streaming 中不允许使用 Kafka 组 ID - Spark Streaming: Kafka group id not permitted in Spark Structured Streaming 如何让 kafka 消费者滞后于 Spark 结构化流媒体应用程序 - How to get kafka consumer lag for spark structured streaming application Spark Streaming 中的 Kafka 消费者 - Kafka consumer in Spark Streaming Spark Structured Streaming 生产者是在 Spark 分区之间还是仅在分区内使用 Kafka 默认分区器? - Does Spark Structured Streaming producer using the Kafka default partitioner between Spark partitions or only within partition? Spark Structured Streaming with Secure Kafka throwing : 无权访问组异常 - Spark Structured Streaming with secured Kafka throwing : Not authorized to access group exception Kafka主题分区为Spark流媒体 - Kafka topic partitions to Spark streaming Kafka protobuf 的 Spark 结构化流 - Spark structured streaming of Kafka protobuf 如何使用直接流在Kafka Spark Streaming中指定使用者组 - how to specify consumer group in Kafka Spark Streaming using direct stream
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM