简体   繁体   English

Scala schema_of_json function 在 spark 结构化流中失败

[英]Scala schema_of_json function fails in spark structured streaming

I have created a function to read JSON as a string with its schema.我创建了一个 function 以将 JSON 读取为带有其模式的字符串。 Then using that function in spark streaming.然后在火花流中使用 function。 I am getting error while doing so.这样做时出现错误。 The same piece works when I create schema first, then use that schema to read, but doesn't work in single line.当我先创建模式,然后使用该模式读取时,同一块工作,但在单行中不起作用。 How can I fix it?我该如何解决?

def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
  
  TOPICS.split(',').foreach(topic =>{
    var TableName = topic.split('.').last.toUpperCase
    var df = microBatchOutputDF
    
    /*var schema = schema_of_json(df
                                .select($"value")
                                .filter($"topic".contains(topic))
                                .as[String]
                               )*/
    
    var jsonDataDf = df.filter($"topic".contains(topic))
                      .withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava))

    var srcTable = jsonDataDf
                    .select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")

    srcTable
      .select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
      .write
      .mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
    
    spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
  } )
}

Spark streaming code火花流代码

import org.apache.spark.sql.streaming.Trigger

val StreamingQuery = InputDf
                        .select("*")
                        .writeStream.outputMode("update")
                        .option("queryName", "StreamingQuery")
                        .foreachBatch(processBatch _)
                        .start()

Error: org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误: org.apache.spark.sql.AnalysisException:模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)

Error –org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误 –org.apache.spark.sql.AnalysisException:模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)

Above error suggests issue with from_json() function.以上错误表明from_json() function 存在问题。

Syntax:- from_json(jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema .语法:- from_json(jsonStr, schema[, options]) - 返回具有给定jsonStrschema的结构值。

Refer below Examples:请参考以下示例:

> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
 {"a":1,"b":0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
 {"time":2015-08-26 00:00:00}

Refer - https://docs.databricks.com/sql/language-manual/functions/from_json.html参考 - https://docs.databricks.com/sql/language-manual/functions/from_json.html

This is how I solved this.这就是我解决这个问题的方法。 I created a filtered dataframe from the kafka output dataframe, and applied all the logics in it, as it was before.我从 kafka output dataframe 创建了一个过滤的 dataframe,并像以前一样应用了其中的所有逻辑。 The problem with generating schema while reading is, from_json doesn't know which exact row to use from all the rows of the dataframe.读取时生成模式的问题是, from_json不知道要使用 dataframe 的所有行中的哪一行。

def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
  
  TOPICS.split(',').foreach(topic =>{
    var TableName = topic.split('.').last.toUpperCase
    var df = microBatchOutputDF.where(col("topic") === topic)
    
    var schema = schema_of_json(df
                                .select($"value")
                                .filter($"topic".contains(topic))
                                .as[String]
                               )
    
    var jsonDataDf = df.withColumn("jsonData", from_json($"value", schema, scala.collection.immutable.Map[String, String]().asJava))

    var srcTable = jsonDataDf
                    .select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")

    srcTable
      .select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
      .write
      .mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
    
    spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
  } )
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming - stderr 被填满 - Spark Structured Streaming - stderr getting filled up Spark Structured Streaming 应用程序中的多个 S3 凭据 - Multiple S3 credentials in a Spark Structured Streaming application 如何使用本地 JAR 文件运行 Spark 结构化流 - How to run Spark structured streaming using local JAR files 如何在控制台中执行 writeStream a dataframe? (Scala Spark 流媒体) - How to do writeStream a dataframe in console? (Scala Spark Streaming) 如何动态地使用 spark scala 将 spark dataframe 转换为嵌套 json - How to Convert spark dataframe to nested json using spark scala dynamically wso2 微型集成商无法使用 json 架构进行验证 - wso2 micro integrator fails to validate with json schema output GCP结构化日志如何在scala? - How to output GCP structured logs in scala? Databricks:Azure Queue Storage structured streaming key not found 错误 - Databricks: Azure Queue Storage structured streaming key not found error Strimzi Kafka + Structured Streaming - 偏移量不可用(这是因为segment.bytes)? - Strimzi Kafka + Structured Streaming - Offsets not available (is this because of segment.bytes)? 使用火花流从 pubsublite 接收消息时出现问题 - Problems receving messages from pubsublite with spark streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM