簡體   English   中英

在Spark列中解析JSON

[英]Parsing JSON in a Spark column

我在類型為String的數據框列中有一個JSON,我想將其轉換為地圖。 這里的問題是我不完全了解JSON的架構,因為鍵名可能會有所不同。

基本上,我的JSON列如下所示:

{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}

我希望這最終看起來像:

{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"}, 
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"}, 
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}

這樣當我輸入“ DYN”時,我可以計算所有價格的均值。

換句話說,使用以下方法讀取JSON數據:

val testJsonData = spark.read.json("file:///data/json_example")

給我以下架構:

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

但是,我想以簡單得多的方式結束:

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

我可以對數據使用哪種轉換才能以上述模式結束?

請讓我知道最簡單的方法。 提前致謝!

您的要求有點復雜,如果我對問題的編輯是正確的,那么可以采取以下解決方案。

您已經具有schema輸入dataframe

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

下一步將更改dataframe以使每個array位於不同的列中

val tempT = testJsonData.select($"outerkey.*")

schema將是

root
 |-- innerkey_1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

由於您想要每個struct的鍵名,因此需要獲取名稱

val schema = tempT.schema.fieldNames

因此架構將是innerkey_1, innerkey_2, innerkey_3

復雜的部分是在struct列中添加鍵名,這需要兩個for循環

import org.apache.spark.sql.functions._
for(column <- schema){
  tt = tt.withColumn(column, explode($"${column}"))
}

for(column <- schema){
  tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}

最后tt將在struct列中添加鍵名,如下所示:

root
 |-- innerkey_1: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_2: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_3: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)

最后一步是將所有元素合並為一column ,這與第一步中的操作相反

val temp = tt.select(array(schema.map(col): _*).as("outerkey"))

tempschema將是您所需的schema

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

temp.toJSON.foreach(x => println(x.toString))應該給你你想要的json數據

{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM