为什么 Spark Structured Streaming 不允许更改输入源的数量？

Question

I would like to build a Spark streaming pipeline that reads from multiple Kafka Topics (that vary in number over time).我想构建一个从多个 Kafka 主题（数量随时间变化）中读取的 Spark 流管道。 I intended on stopping the the streaming job, adding/removing the new topics, and starting the job again whenever I required an update to the topics in the streaming job using one of the two options outlined in the Spark Structured Streaming + Kafka Integration Guide :我打算停止流式传输作业，添加/删除新主题，并在需要更新流式传输作业中的主题时使用Spark Structured Streaming + Kafka 集成指南中概述的两个选项之一重新启动作业：

# Subscribe to multiple topics
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1,topic2") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Subscribe to a pattern
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribePattern", "topic.*") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Upon further investigation, I noticed the following point in the Spark Structured Streaming Programming Guide and am trying to understand why changing the number of input sources is "not allowed":经过进一步调查，我注意到Spark Structured Streaming Programming Guide中的以下几点，并试图了解为什么“不允许”更改输入源的数量：

Changes in the number or type (ie different source) of input sources: This is not allowed.输入源的数量或类型（即不同的源）的变化：这是不允许的。

Definition of "Not Allowed" (also from Spark Structured Streaming Programming Guide ): “不允许”的定义（也来自Spark Structured Streaming Programming Guide ）：

The term not allowed means you should not do the specified change as the restarted query is likely to fail with unpredictable errors.术语不允许意味着您不应该进行指定的更改，因为重新启动的查询可能会因不可预知的错误而失败。 sdf represents a streaming DataFrame/Dataset generated with sparkSession.readStream. sdf 表示使用 sparkSession.readStream 生成的流式 DataFrame/Dataset。

My understanding is that Spark Structured Streaming implements its own checkpointing mechanism :我的理解是 Spark Structured Streaming 实现了自己的检查点机制：

In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off.如果发生故障或故意关闭，您可以恢复先前查询的先前进度和 state，并从中断处继续。 This is done using checkpointing and write-ahead logs.这是使用检查点和预写日志完成的。 You can configure a query with a checkpoint location, and the query will save all the progress information (ie range of offsets processed in each trigger) and the running aggregates (eg word counts in the quick example) to the checkpoint location.您可以使用检查点位置配置查询，并且该查询会将所有进度信息（即每个触发器中处理的偏移范围）和正在运行的聚合（例如快速示例中的字数）保存到检查点位置。 This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query.此检查点位置必须是 HDFS 兼容文件系统中的路径，并且可以在启动查询时在 DataStreamWriter 中设置为选项。

Can someone please explain why changing the number of sources is "not allowed"?有人可以解释为什么“不允许”更改来源的数量吗？ I assume that would be one of the benefits of the checkpointing mechanism.我认为这将是检查点机制的好处之一。

Answer 1

Steps to add new input source in existing running model streaming job在现有正在运行的 model 流作业中添加新输入源的步骤

Stop the current running Streaming in which model is running.停止正在运行 model 的当前正在运行的 Streaming。
hdfs dfs -get output/checkpoints/<model_name>offsets <local_directory>/offsets hdfs dfs -get output/checkpoints/<model_name>offsets <local_directory>/offsets

There will be 3 files(since last 3 offsets are saved by spark) in the directory.目录中将有 3 个文件（因为最后 3 个偏移量由 spark 保存）。 sample format below for single file下面是单个文件的示例格式

v1 v1

{ "batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{ "logOffset":0}
{ "logOffset":0}

each {"logOffset":batchId} represents single input source.每个 {"logOffset":batchId} 代表单个输入源。
To add new input source add "-" at the end of each file in the directory.要添加新的输入源，请在目录中每个文件的末尾添加“-”。

Sample updated file v1示例更新文件v1

{"batchWatermarkMs":0,"batchTimestampMs":1578463128395,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
{"logOffset":0}

If you want to add more than 1 input source then add "-" equal to number of new input source.如果要添加超过 1 个输入源，则添加“-”等于新输入源的数量。
hdfs dfs -put -f <local_directory>/offsets output/checkpoints/<model_name>offsets hdfs dfs -put -f <local_directory>/offsets output/checkpoints/<model_name>offsets

Answer 2

The best way to do what you want it's running your readStreams in multiple thread.做你想做的最好的方法是在多线程中运行你的 readStreams。 I'm doing this, reading 40 tables at same time.我正在这样做，同时阅读 40 张桌子。 For do this I follow this article: https://cm.engineering/multiple-spark-streaming-jobs-in-a-single-emr-cluster-ca86c28d1411 .为此，我遵循这篇文章： https://cm.engineering/multiple-spark-streaming-jobs-in-a-single-emr-cluster-ca86c28d1411 。

I will do a quick brief about what I do after read and mount my code structure with main function, executor, and a trait with my spark session who will be shared with all jobs.我将简要介绍一下我在阅读并使用主要 function、执行程序和我的 spark session 将与所有作业共享的特征安装我的代码结构后所做的事情。

1.Two Lists of the topics that I want to read. 1.我想阅读的主题的两个列表。

So, in Scala I create two lists.所以，在 Scala 我创建了两个列表。 The first list is the topics that always I want to read and the second list it's a Dynamic list where when I stop my job I can add some new topics.第一个列表是我一直想阅读的主题，第二个列表是动态列表，当我停止工作时，我可以添加一些新主题。

Pattern Matching to run the jobs.运行作业的模式匹配。

I have two job different jobs, one that I run to the tables that always I run and Dynamic jobs that I run to specifc topics,in other words, If I want to add a new topic and create a new job to him, I add this job in pattern matching.我有两个不同的工作，一个运行到我一直运行的表和运行到特定主题的动态作业，换句话说，如果我想添加一个新主题并为他创建一个新工作，我添加模式匹配中的这项工作。 In the bellow code, I want to run specfic job to the Cars and Ship tables and all another tables that I put in the specifc list will run the same replication table job在下面的代码中，我想对 Cars 和 Ship 表运行特定作业，并且我放入特定列表中的所有其他表都将运行相同的复制表作业

  var tables = specifcTables ++ dynamicTables

  tables.map(table => {
    table._1 match {
      case "CARS" => new CarsJob
      case "SHIPS" => new ShipsReplicationJob
      case _ => new ReplicationJob

After this I pass this pattern matching to a createjobs function that will instantiate each of these jobs and I pass this function to a startFutureTask function who will put each of these jobs in different threads在此之后，我将此模式匹配传递给 createjobs function，它将实例化这些作业中的每一个，然后我将此 function 传递给 startFutureTask ZC1C425268E68385D1AB5074C17A9，后者会将这些作业中的每个作业放入不同的线程中

startFutureTasks(createJobs(tables))

I hope I've helped.我希望我有所帮助。 Thanks !谢谢！

为什么 Spark Structured Streaming 不允许更改输入源的数量？

问题描述

2 个解决方案

解决方案1
0 2020-09-28 21:04:30

解决方案2
0 2020-12-28 15:00:17

为什么 Spark Structured Streaming 不允许更改输入源的数量？

问题描述

2 个解决方案

解决方案1 0 2020-09-28 21:04:30

解决方案2 0 2020-12-28 15:00:17

解决方案1
0 2020-09-28 21:04:30

解决方案2
0 2020-12-28 15:00:17