Scala schema_of_json function 在 spark 结构化流中失败

Question

I have created a function to read JSON as a string with its schema.我创建了一个 function 以将 JSON 读取为带有其模式的字符串。 Then using that function in spark streaming.然后在火花流中使用 function。 I am getting error while doing so.这样做时出现错误。 The same piece works when I create schema first, then use that schema to read, but doesn't work in single line.当我先创建模式，然后使用该模式读取时，同一块工作，但在单行中不起作用。 How can I fix it?我该如何解决？

def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
  
  TOPICS.split(',').foreach(topic =>{
    var TableName = topic.split('.').last.toUpperCase
    var df = microBatchOutputDF
    
    /*var schema = schema_of_json(df
                                .select($"value")
                                .filter($"topic".contains(topic))
                                .as[String]
                               )*/
    
    var jsonDataDf = df.filter($"topic".contains(topic))
                      .withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava))

    var srcTable = jsonDataDf
                    .select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")

    srcTable
      .select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
      .write
      .mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
    
    spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
  } )
}

Spark streaming code火花流代码

import org.apache.spark.sql.streaming.Trigger

val StreamingQuery = InputDf
                        .select("*")
                        .writeStream.outputMode("update")
                        .option("queryName", "StreamingQuery")
                        .foreachBatch(processBatch _)
                        .start()

Error: org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误： org.apache.spark.sql.AnalysisException：模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)

Answer 1

Error –org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误 –org.apache.spark.sql.AnalysisException：模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)

Above error suggests issue with from_json() function.以上错误表明from_json() function 存在问题。

Syntax:- from_json(jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema .语法：- from_json(jsonStr, schema[, options]) - 返回具有给定jsonStr和schema的结构值。

Refer below Examples:请参考以下示例：

> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
 {"a":1,"b":0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
 {"time":2015-08-26 00:00:00}

Refer - https://docs.databricks.com/sql/language-manual/functions/from_json.html参考 - https://docs.databricks.com/sql/language-manual/functions/from_json.html

Answer 2

This is how I solved this.这就是我解决这个问题的方法。 I created a filtered dataframe from the kafka output dataframe, and applied all the logics in it, as it was before.我从 kafka output dataframe 创建了一个过滤的 dataframe，并像以前一样应用了其中的所有逻辑。 The problem with generating schema while reading is, from_json doesn't know which exact row to use from all the rows of the dataframe.读取时生成模式的问题是， from_json不知道要使用 dataframe 的所有行中的哪一行。

def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
  
  TOPICS.split(',').foreach(topic =>{
    var TableName = topic.split('.').last.toUpperCase
    var df = microBatchOutputDF.where(col("topic") === topic)
    
    var schema = schema_of_json(df
                                .select($"value")
                                .filter($"topic".contains(topic))
                                .as[String]
                               )
    
    var jsonDataDf = df.withColumn("jsonData", from_json($"value", schema, scala.collection.immutable.Map[String, String]().asJava))

    var srcTable = jsonDataDf
                    .select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")

    srcTable
      .select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
      .write
      .mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
    
    spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
  } )
}

Scala schema_of_json function 在 spark 结构化流中失败

问题描述

2 个解决方案

解决方案1
1 2022-03-23 11:57:21

解决方案2
0 已采纳 2022-03-26 09:46:04

Scala schema_of_json function 在 spark 结构化流中失败

问题描述

2 个解决方案

解决方案1 1 2022-03-23 11:57:21

解决方案2 0 已采纳 2022-03-26 09:46:04

解决方案1
1 2022-03-23 11:57:21

解决方案2
0 已采纳 2022-03-26 09:46:04