在Spark列中解析JSON

Question

我在類型為String的數據框列中有一個JSON，我想將其轉換為地圖。 這里的問題是我不完全了解JSON的架構，因為鍵名可能會有所不同。

基本上，我的JSON列如下所示：

{"outerkey":{"innerkey_1":[{"uid":"1","price":0.01,"type":"STAT"}],
"innerkey_2":[{"uid":"1","price":4.3,"type":"DYN"}],
"innerkey_3":[{"uid":"1","price":2.0,"type":"DYN"}]}}

我希望這最終看起來像：

{"outerkey":
[{"keyname":"innerkey_1","uid":"1","price":0.01,"type":"STAT"}, 
{"keyname":"innerkey_2","uid":"1","price":4.3,"type":"DYN"}, 
{"keyname":"innerkey_3","uid":"1","price":2.0,"type":"DYN"}]}

這樣當我輸入“ DYN”時，我可以計算所有價格的均值。

換句話說，使用以下方法讀取JSON數據：

val testJsonData = spark.read.json("file:///data/json_example")

給我以下架構：

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

但是，我想以簡單得多的方式結束：

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

我可以對數據使用哪種轉換才能以上述模式結束？

請讓我知道最簡單的方法。 提前致謝！

Answer 1

您的要求有點復雜，如果我對問題的編輯是正確的，那么可以采取以下解決方案。

您已經具有schema輸入dataframe為

root
 |-- outerkey: struct (nullable = true)
 |    |-- innerkey_1: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_2: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)
 |    |-- innerkey_3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |    |-- uid: string (nullable = true)

下一步將更改dataframe以使每個array位於不同的列中

val tempT = testJsonData.select($"outerkey.*")

其schema將是

root
 |-- innerkey_1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)
 |-- innerkey_3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

由於您想要每個struct的鍵名，因此需要獲取名稱

val schema = tempT.schema.fieldNames

因此架構將是innerkey_1, innerkey_2, innerkey_3

復雜的部分是在struct列中添加鍵名，這需要兩個for循環

import org.apache.spark.sql.functions._
for(column <- schema){
  tt = tt.withColumn(column, explode($"${column}"))
}

for(column <- schema){
  tt = tt.withColumn(column, struct(lit(column).as("keyname"), $"${column}.*"))
}

最后tt將在struct列中添加鍵名，如下所示：

root
 |-- innerkey_1: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_2: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)
 |-- innerkey_3: struct (nullable = false)
 |    |-- keyname: string (nullable = false)
 |    |-- price: double (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- uid: string (nullable = true)

最后一步是將所有元素合並為一column ，這與第一步中的操作相反

val temp = tt.select(array(schema.map(col): _*).as("outerkey"))

temp的schema將是您所需的schema

root
 |-- outerkey: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- keyname: string (nullable = false)
 |    |    |-- price: double (nullable = true)
 |    |    |-- type: string (nullable = true)
 |    |    |-- uid: string (nullable = true)

和temp.toJSON.foreach(x => println(x.toString))應該給你你想要的json數據

{"outerkey":[{"keyname":"innerkey_1","price":0.01,"type":"STAT","uid":"1"},{"keyname":"innerkey_2","price":4.3,"type":"DYN","uid":"1"},{"keyname":"innerkey_3","price":2.0,"type":"DYN","uid":"1"}]}

在Spark列中解析JSON

問題描述

1 個解決方案

解決方案1
0 2017-12-02 06:59:33

在Spark列中解析JSON

問題描述

1 個解決方案

解決方案1 0 2017-12-02 06:59:33

解決方案1
0 2017-12-02 06:59:33