[英]How to read such a nested multiline json file into a data frame with Spark/Scala
我有以下 json:
{
"value":[
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"}
]
}
我正在嘗試這樣閱讀:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
.show(10)
但它無法在數據框中正確顯示我的記錄,我如何在 go 周圍嵌套該“值”以正確地將我的行放在 dataframe 中?
我試圖得到的結果是:
C1 | C2
-------------------
VAL1 | VAL2
VAL1 | VAL2
...etc
查看 spark.read 返回的 Dataframe ( jsonDf
) 的架構:
jsonDf.printSchema()
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- C1: string (nullable = true)
| | |-- C2: string (nullable = true)
您可以使用 sql function explode
然后 select 兩個元素C1
和C2
如下所示:
val df = jsonDf
.withColumn("parsedJson", explode(col("value")))
.withColumn("C1", col("parsedJson.C1"))
.withColumn("C2", col("parsedJson.C2"))
.select(col("C1"), col("C2"))
.show(false)
這導致了所需的結果:
+----+----+
|C1 |C2 |
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
我終於設法使用以下 function 找到解決問題的方法:
def flattenDataframe(df: DataFrame): DataFrame = {
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length
for(i <- 0 to fields.length-1){
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
// val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
return flattenDataframe(explodedDf)
case structType: StructType =>
val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
val explodedf = df.select(renamedcols:_*)
return flattenDataframe(explodedf)
case _ =>
}
}
df
}
來源https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7
使用inline
將完成這項工作:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
val df2 = df.selectExpr("inline(value)")
df2.show
+----+----+
| C1| C2|
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.