使用 Kafka 源的 Spark Structured Streaming，在查询运行时更改主题分区的数量

Question

I've set up a Spark structured streaming query that reads from a Kafka topic.我已经设置了一个从 Kafka 主题读取的 Spark 结构化流式查询。 If the number of partitions in the topic is changed while the Spark query is running, Spark does not seem to notice and data on new partitions is not consumed.如果在 Spark 查询运行时主题中的分区数发生变化，Spark 似乎不会注意到，并且不会消耗新分区上的数据。

Is there a way to tell Spark to check for new partitions in the same topic apart from stopping the query an restarting it?除了停止查询并重新启动它之外，有没有办法告诉 Spark 检查同一主题中的新分区？

EDIT: I'm using Spark 2.4.4.编辑：我正在使用 Spark 2.4.4。 I read from kafka as follows:我从kafka中读取如下：

spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", kafkaURL)
      .option("startingOffsets", "earliest")
      .option("subscribe", topic)
      .option("failOnDataLoss", value = false)
      .load()

after some processing, I write to HDFS on a Delta Lake table.经过一些处理后，我在 Delta Lake 表上写信给 HDFS。

Answer 1

Answering my own question.回答我自己的问题。 Kafka consumers check for new partitions/topic (in case of subscribing to topics with a pattern) every metadata.max.age.ms , whose default value is 300000 (5 minutes). Kafka 消费者每隔metadata.max.age.ms检查新的分区/主题（如果订阅带有模式的主题），其默认值为300000 （5 分钟）。

Since my test was lasting far less than that, I wouldn't notice the update.由于我的测试持续时间远少于此，因此我不会注意到更新。 For tests, reduce the value to something smaller, eg 100 ms, by setting the following option of the DataStreamReader :对于测试，通过设置DataStreamReader的以下选项将值减小到较小的值，例如 100 毫秒：

.option("kafka.metadata.max.age.ms", 100)

使用 Kafka 源的 Spark Structured Streaming，在查询运行时更改主题分区的数量

问题描述

1 个解决方案

解决方案1
0 2019-11-28 10:51:41

使用 Kafka 源的 Spark Structured Streaming，在查询运行时更改主题分区的数量

问题描述

1 个解决方案

解决方案1 0 2019-11-28 10:51:41

解决方案1
0 2019-11-28 10:51:41