[英]How to read such a nested multiline json file into a data frame with Spark/Scala
I have the following json:我有以下 json:
{
"value":[
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"},
{"C1":"val1","C2":"val2"}
]
}
That i am trying to read like this:我正在尝试这样阅读:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
.show(10)
But it is not able to show me my records properly in the data frame, how do I go around that "value" nesting to properly have my rows in the dataframe?但它无法在数据框中正确显示我的记录,我如何在 go 周围嵌套该“值”以正确地将我的行放在 dataframe 中?
The result I am trying to get is:我试图得到的结果是:
C1 | C2
-------------------
VAL1 | VAL2
VAL1 | VAL2
...etc
Looking at the schema of the Dataframe ( jsonDf
) returned by spark.read:查看 spark.read 返回的 Dataframe (
jsonDf
) 的架构:
jsonDf.printSchema()
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- C1: string (nullable = true)
| | |-- C2: string (nullable = true)
you could use the sql function explode
and then select the two elements C1
and C2
as shown below:您可以使用 sql function
explode
然后 select 两个元素C1
和C2
如下所示:
val df = jsonDf
.withColumn("parsedJson", explode(col("value")))
.withColumn("C1", col("parsedJson.C1"))
.withColumn("C2", col("parsedJson.C2"))
.select(col("C1"), col("C2"))
.show(false)
This leads to the required outcome:这导致了所需的结果:
+----+----+
|C1 |C2 |
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
I finally managed to find a solution to my problem using the following function:我终于设法使用以下 function 找到解决问题的方法:
def flattenDataframe(df: DataFrame): DataFrame = {
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length
for(i <- 0 to fields.length-1){
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
// val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
return flattenDataframe(explodedDf)
case structType: StructType =>
val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
val explodedf = df.select(renamedcols:_*)
return flattenDataframe(explodedf)
case _ =>
}
}
df
}
Source https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7来源https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7
Using inline
will do the job:使用
inline
将完成这项工作:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/Projects.json")
val df2 = df.selectExpr("inline(value)")
df2.show
+----+----+
| C1| C2|
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.