[英]Parsing JSON in a Spark column
我在類型為String的數據框列中有一個JSON,我想將其轉換為地圖。 這里的問題是我不完全了解JSON的架構,因為鍵名可能會有所不同。
基本上,我的JSON列如下所示:
{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}
我希望這最終看起來像:
{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"},
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"},
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}
這樣當我輸入“ DYN”時,我可以計算所有價格的均值。
換句話說,使用以下方法讀取JSON數據:
val testJsonData = spark.read.json("file:///data/json_example")
給我以下架構:
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
但是,我想以簡單得多的方式結束:
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
我可以對數據使用哪種轉換才能以上述模式結束?
請讓我知道最簡單的方法。 提前致謝!
您的要求有點復雜,如果我對問題的編輯是正確的,那么可以采取以下解決方案。
您已經具有schema
輸入dataframe
為
root
|-- outerkey: struct (nullable = true)
| |-- innerkey_1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
| |-- innerkey_3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
| | | |-- uid: string (nullable = true)
下一步將更改dataframe
以使每個array
位於不同的列中
val tempT = testJsonData.select($"outerkey.*")
其schema
將是
root
|-- innerkey_1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
|-- innerkey_3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
由於您想要每個struct
的鍵名,因此需要獲取名稱
val schema = tempT.schema.fieldNames
因此架構將是innerkey_1, innerkey_2, innerkey_3
復雜的部分是在struct
列中添加鍵名,這需要兩個for
循環
import org.apache.spark.sql.functions._
for(column <- schema){
tt = tt.withColumn(column, explode($"${column}"))
}
for(column <- schema){
tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}
最后tt
將在struct
列中添加鍵名,如下所示:
root
|-- innerkey_1: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_2: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
|-- innerkey_3: struct (nullable = false)
| |-- keyname: string (nullable = false)
| |-- price: double (nullable = true)
| |-- type: string (nullable = true)
| |-- uid: string (nullable = true)
最后一步是將所有元素合並為一column
,這與第一步中的操作相反
val temp = tt.select(array(schema.map(col): _*).as("outerkey"))
temp
的schema
將是您所需的schema
root
|-- outerkey: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- keyname: string (nullable = false)
| | |-- price: double (nullable = true)
| | |-- type: string (nullable = true)
| | |-- uid: string (nullable = true)
和temp.toJSON.foreach(x => println(x.toString))
應該給你你想要的json
數據
{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.