[英]Scala schema_of_json function fails in spark structured streaming
I have created a function to read JSON as a string with its schema.我创建了一个 function 以将 JSON 读取为带有其模式的字符串。 Then using that function in spark streaming.然后在火花流中使用 function。 I am getting error while doing so.这样做时出现错误。 The same piece works when I create schema first, then use that schema to read, but doesn't work in single line.当我先创建模式,然后使用该模式读取时,同一块工作,但在单行中不起作用。 How can I fix it?我该如何解决?
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF
/*var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)*/
var jsonDataDf = df.filter($"topic".contains(topic))
.withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}
Spark streaming code火花流代码
import org.apache.spark.sql.streaming.Trigger
val StreamingQuery = InputDf
.select("*")
.writeStream.outputMode("update")
.option("queryName", "StreamingQuery")
.foreachBatch(processBatch _)
.start()
Error: org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误: org.apache.spark.sql.AnalysisException:模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)
Error –org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)错误 –org.apache.spark.sql.AnalysisException:模式应以 DDL 格式指定为字符串文字或 schema_of_json/schema_of_csv 函数的 output 而不是 schema_of_json(value)
Above error suggests issue with from_json()
function.以上错误表明from_json()
function 存在问题。
Syntax:- from_json(jsonStr, schema[, options])
- Returns a struct value with the given jsonStr
and schema
.语法:- from_json(jsonStr, schema[, options])
- 返回具有给定jsonStr
和schema
的结构值。
Refer below Examples:请参考以下示例:
> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
{"a":1,"b":0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{"time":2015-08-26 00:00:00}
Refer - https://docs.databricks.com/sql/language-manual/functions/from_json.html参考 - https://docs.databricks.com/sql/language-manual/functions/from_json.html
This is how I solved this.这就是我解决这个问题的方法。 I created a filtered dataframe from the kafka output dataframe, and applied all the logics in it, as it was before.我从 kafka output dataframe 创建了一个过滤的 dataframe,并像以前一样应用了其中的所有逻辑。 The problem with generating schema while reading is, from_json
doesn't know which exact row to use from all the rows of the dataframe.读取时生成模式的问题是, from_json
不知道要使用 dataframe 的所有行中的哪一行。
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF.where(col("topic") === topic)
var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)
var jsonDataDf = df.withColumn("jsonData", from_json($"value", schema, scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.