简体   繁体   English

Spark Structured Streaming - 由于增加输入源的数量,检查点中的 AssertionError

[英]Spark Structured Streaming - AssertionError in Checkpoint due to increasing the number of input sources

I am trying to join two streams into one and write the result to a topic我正在尝试将两个流合并为一个并将结果写入一个主题

code: 1- Reading two topics代码:1-阅读两个主题

val PERSONINFORMATION_df: DataFrame = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "xx:9092")
    .option("subscribe", "PERSONINFORMATION")
    .option("group.id", "info")
    .option("maxOffsetsPerTrigger", 1000)
    .option("startingOffsets", "earliest")
    .load()


val CANDIDATEINFORMATION_df: DataFrame = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "xxx:9092")
    .option("subscribe", "CANDIDATEINFORMATION")
    .option("group.id", "candent")
    .option("startingOffsets", "earliest")
    .option("maxOffsetsPerTrigger", 1000)
    .option("failOnDataLoss", "false")
    .load()

2- Parse data to join them: 2-解析数据以加入它们:

val parsed_PERSONINFORMATION_df: DataFrame = PERSONINFORMATION_df
      .select(from_json(expr("cast(value as string) as actualValue"), schemaPERSONINFORMATION).as("s")).select("s.*")

   val parsed_CANDIDATEINFORMATION_df: DataFrame = CANDIDATEINFORMATION_df
      .select(from_json(expr("cast(value as string) as actualValue"), schemaCANDIDATEINFORMATION).as("s")).select("s.*")

   val df_person = parsed_PERSONINFORMATION_df.as("dfperson")
   val df_candidate = parsed_CANDIDATEINFORMATION_df.as("dfcandidate")

3- Join two frames 3-加入两个框架

  val joined_df : DataFrame = df_candidate.join(df_person, col("dfcandidate.PERSONID") === col("dfperson.ID"),"inner")

  val string2json: DataFrame = joined_df.select($"dfcandidate.ID".as("key"),to_json(struct($"dfcandidate.ID", $"FULLNAME", $"PERSONALID")).cast("String").as("value"))

4- Write them to a topic 4-将它们写到一个主题

  string2json.writeStream.format("kafka")
      .option("kafka.bootstrap.servers", xxxx:9092")
      .option("topic", "toDelete")
      .option("checkpointLocation", "checkpoints")
      .option("failOnDataLoss", "false")
      .start()
      .awaitTermination()

Error message:错误信息:

    21/01/25 11:01:41 ERROR streaming.MicroBatchExecution: Query [id = 9ce8bcf2-0299-42d5-9b5e-534af8d689e3, runId = 0c0919c6-f49e-48ae-a635-2e95e31fdd50] terminated with error
java.lang.AssertionError: assertion failed: There are [1] sources in the checkpoint offsets and now there are [2] sources requested by the query. Cannot continue.
       

Your code looks fine to me, it is rather the checkpointing that is causing the issue.您的代码对我来说看起来不错,而是导致问题的检查点。

Based on the error message you are getting you probably ran this job with only one stream source.根据您收到的错误消息,您可能只使用一个 stream 源来运行此作业。 Then, you added the code for the stream join and tried to re-start the application without remiving existing checkpoint files.然后,您添加了 stream 连接的代码,并尝试在不删除现有检查点文件的情况下重新启动应用程序。 Now, the application tries to recover from the checkpoint files but realises that you initially had only one source and now you have two sources.现在,应用程序尝试从检查点文件中恢复,但意识到您最初只有一个源,而现在您有两个源。

The section Recovery Semantics after Changes in a Streaming Query explains which changes are allowed and not allowed when using checkpointing. 流式查询中更改后的恢复语义部分解释了使用检查点时允许和不允许哪些更改。 Changing the number of input sources is not allowed:不允许更改输入源的数量:

"Changes in the number or type (ie different source) of input sources: This is not allowed." “改变输入源的数量或类型(即不同的来源):这是不允许的。”

To solve your problem: Delete the current checkpoint files and re-start the job.解决您的问题:删除当前检查点文件并重新开始作业。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM