简体   繁体   English

使用 Kafka 源的 Spark Structured Streaming,在查询运行时更改主题分区的数量

[英]Spark Structured Streaming with Kafka source, change number of topic partitions while query is running

I've set up a Spark structured streaming query that reads from a Kafka topic.我已经设置了一个从 Kafka 主题读取的 Spark 结构化流式查询。 If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.如果在 Spark 查询运行时主题中的分区数发生变化,Spark 似乎不会注意到,并且不会消耗新分区上的数据。

Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?除了停止查询并重新启动它之外,有没有办法告诉 Spark 检查同一主题中的新分区?

EDIT: I'm using Spark 2.4.4.编辑:我正在使用 Spark 2.4.4。 I read from kafka as follows:我从kafka中读取如下:

spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaURL)
      .option("startingOffsets", "earliest")
      .option("subscribe", topic)
      .option("failOnDataLoss", value = false)
      .load()

after some processing, I write to HDFS on a Delta Lake table.经过一些处理后,我在 Delta Lake 表上写信给 HDFS。

Answering my own question.回答我自己的问题。 Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms , whose default value is 300000 (5 minutes). Kafka 消费者每隔metadata.max.age.ms检查新的分区/主题(如果订阅带有模式的主题),其默认值为300000 (5 分钟)。

Since my test was lasting far less than that, I wouldn't notice the update.由于我的测试持续时间远少于此,因此我不会注意到更新。 For tests, reduce the value to something smaller, eg 100 ms, by setting the following option of the DataStreamReader :对于测试,通过设置DataStreamReader的以下选项将值减小到较小的值,例如 100 毫秒:

.option("kafka.metadata.max.age.ms", 100)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Kafka主题分区为Spark流媒体 - Kafka topic partitions to Spark streaming 主题中多个分区的Spark结构化流 - Spark Structured Streaming for multiple partitions in a topic 将聚合结果发送到 Kafka 主题时 Spark 结构化流错误 - Spark Structured Streaming Error while sending aggregated result to Kafka Topic Kafka 消费者组和分区与 Spark 结构化流 - Kafka consumer group and partitions with Spark structured streaming 将Spark结构化流输出写入Kafka主题 - Writing Spark Structured Streaming Output to a Kafka Topic 如何将 Spark 模式应用于 Spark Structured Streaming 中基于 Kafka 主题名称的查询? - How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming? 没有输出到Kafka主题:Spark结构化流和Kafka集成 - No output to Kafka topic: Spark Structured Streaming and Kafka Integration 在使用 kafka 和 Spark 流创建直接流之前获取主题的分区数? - Get number of partitions for a topic before creating a direct stream using kafka and spark streaming? 解释 Spark Structured Streaming executor 和 Kafka partitions 之间的映射 - Explain mapping between Spark Structured Streaming executors and Kafka partitions 如何通过spark结构化流以编程方式在Kafka中创建Topic - How to programmatically create a Topic in Kafka through spark structured streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM