![](/img/trans.png)
[英]Parse & flatten JSON object in a text file using Spark & Scala into Dataframe
[英]Parse JSON file using Spark Scala
我有如下所示的JSON源數據文件,我將需要一種完全不同的格式的“ 預期結果” ,這也在下面顯示,有沒有一種方法可以使用Spark Scala實現。 感謝您的幫助
JSON源數據文件
{
"APP": [
{
"E": 1566799999225,
"V": 44.0
},
{
"E": 1566800002758,
"V": 61.0
}
],
"ASP": [
{
"E": 1566800009446,
"V": 23.399999618530273
}
],
"TT": 0,
"TVD": [
{
"E": 1566799964040,
"V": 50876515
}
],
"VIN": "FU74HZ501740XXXXX"
}
預期成績:
JSON模式:
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- ASP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- ATO: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- RPM: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- TT: long (nullable = true)
|-- TVD: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- VIN: string (nullable = true)
您可以先閱讀json文件:
val inputDataFrame: DataFrame = sparkSession
.read
.option("multiline", true)
.json(yourJsonPath)
然后,您可以創建一個簡單的規則來獲取APP, ASP, ATO
,因為它是輸入中唯一具有struct數據類型的字段:
val inputDataFrameFields: Array[StructField] = inputDataFrame.schema.fields
var snColumn = new Array[String](inputDataFrame.schema.length)
for( x <- 0 to (inputDataFrame.schema.length -1)) {
if(inputDataFrameFields.apply(x).dataType.isInstanceOf[ArrayType] && !inputDataFrameFields.apply(x).name.isEmpty) {
snColumn(x) = inputDataFrameFields.apply(x).name
}
}
然后,按照以下步驟創建空數據框並填充它:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField(
"EVENTS",
ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("E", IntegerType, true),
StructField("V", DoubleType, true)
)))),
StructField("TT", StringType, true)
)
)
val outputDataFrame = sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], outputSchema)
然后,您需要創建一些udf來解析您的輸入並進行正確的映射。
希望這可以幫助
這是將json解析為適合您的數據的spark數據框的解決方案:
val input = "{\"APP\":[{\"E\":1566799999225,\"V\":44.0},{\"E\":1566800002758,\"V\":61.0}],\"ASP\":[{\"E\":1566800009446,\"V\":23.399999618530273}],\"TT\":0,\"TVD\":[{\"E\":1566799964040,\"V\":50876515}],\"VIN\":\"FU74HZ501740XXXXX\"}"
import sparkSession.implicits._
val outputDataFrame = sparkSession.read.option("multiline", true).option("mode","PERMISSIVE")
.json(Seq(input).toDS)
.withColumn("APP", explode(col("APP")))
.withColumn("ASP", explode(col("ASP")))
.withColumn("TVD", explode(col("TVD")))
.select(
col("VIN"),col("TT"),
col("APP").getItem("E").as("APP_E"),
col("APP").getItem("V").as("APP_V"),
col("ASP").getItem("E").as("ASP_E"),
col("ASP").getItem("V").as("ASP_E"),
col("TVD").getItem("E").as("TVD_E"),
col("TVD").getItem("V").as("TVD_E")
)
outputDataFrame.show(truncate = false)
/*
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
|VIN |TT |APP_E |APP_V|ASP_E |ASP_E |TVD_E |TVD_E |
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
|FU74HZ501740XXXXX|0 |1566799999225|44.0 |1566800009446|23.399999618530273|1566799964040|50876515|
|FU74HZ501740XXXXX|0 |1566800002758|61.0 |1566800009446|23.399999618530273|1566799964040|50876515|
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
*/
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.