簡體   English   中英

使用火花流分析事件中心消息上的 Json

[英]Parsing Json on Event Hub messages using spark streaming

我正在嘗試解析通過 EventHub 流式傳輸的 json 文件,我將消息正文轉換為字符串,然后使用 from_json,如下所示。 我可以將整個 json object 保存為增量表中的單個單元格(當我在下面的代碼中從 df4 寫入流時發生這種情況),但是當我使用 body.* 或 col(body.*) 拆分 Z466DEEC76ECDF5FCA64D328進入多列我得到一個錯誤。 有關如何處理此問題的任何建議。

// Scala Code //
val incomingStream = spark.readStream.format("eventhubs").options(customEventhubParameters.toMap).load()

incomingStream.printSchema()

val outputStream = incomingStream.select($"body".cast(StringType)).alias("body")
                                
val df = outputStream.toDF()
val df4=df.select(from_json(col("body"),jsonSchema))
val df5=df4.select("body.*")

df5.writeStream
  .format("delta")
  .outputMode("append")
  .option("ignoreChanges", "true")
  .option("checkpointLocation", "/mnt/abc/checkpoints/samplestream")
  .start("/mnt/abc/samplestream")

Output

root
 |-- body: binary (nullable = true)
 |-- partition: string (nullable = true)
 |-- offset: string (nullable = true)
 |-- sequenceNumber: long (nullable = true)
 |-- enqueuedTime: timestamp (nullable = true)
 |-- publisher: string (nullable = true)
 |-- partitionKey: string (nullable = true)
 |-- properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- systemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

root
 |-- body: string (nullable = true)

AnalysisException: cannot resolve 'body.*' given input columns 'body'
    at org.apache.spark.sql.catalyst.analysis.UnresolvedStarBase.expand(unresolved.scala:416)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$expand$1(Analyzer.scala:2507)
    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$expand(Analyzer.scala:2506)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.$anonfun$buildExpandedProjectList$1(Analyzer.scala:2526)
    at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
    at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.buildExpandedProjectList(Analyzer.scala:2524)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$18.applyOrElse(Analyzer.scala:2238)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$18.applyOrElse(Analyzer.scala:2233)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$3(AnalysisHelper.scala:137)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:86)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUpWithPruning$1(AnalysisHelper.scala:137)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:340)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUpWithPruning(AnalysisHelper.scala:133)

下面的鏈接顯示了在控制台上顯示的方式,它對我有用,我正在嘗試將 json 寫入具有多列的增量文件。

[https://stackoverflow.com/questions/57298849/parsing-event-hub-messages-using-spark-streaming]

您的代碼的問題似乎是您如何使用alias ,因此列body不再可用。 關於您在代碼中使用別名的一些觀察結果以及您可能會嘗試解決此問題的方法:

觀察 1

問題:

val outputStream = incomingStream.select($"body".cast(StringType)).alias("body")

您上面的代碼是整個 dataframe 的別名。 如果您的意圖是確保在字符串轉換后將body列別名為body ,您可以嘗試以下操作

建議:

val outputStream = incomingStream.select($"body".cast(StringType).alias("body"))

觀察 2

你在哪里

問題:

val df4=df.select(from_json(col("body"),jsonSchema))

您應該使用別名,以便以后可以訪問它,因為它現在被另一個名稱引用(您可以使用printSchemashow在調試時自己查看)。

建議:

val df4=df.select(from_json(col("body"),jsonSchema).alias("body"))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM