[英]Merge Dataframes With Differents Schemas - Scala Spark
我正在将 JSON 转换为数据框。 在第一步中,我创建了一个数据框数组,然后创建了一个联合。 但是我在 JSON 中使用不同的模式进行联合时遇到了问题。
如果 JSON 具有与您在其他问题中看到的相同的架构,我可以做到这一点: Parse JSON root in a column using Spark-Scala
我正在处理以下数据:
val exampleJsonDifferentSchema = spark.createDataset(
"""
{"ITEM1512":
{"name":"Yin",
"address":{"city":"Columbus",
"state":"Ohio"},
"age":28 },
"ITEM1518":
{"name":"Yang",
"address":{"city":"Working",
"state":"Marc"}
},
"ITEM1458":
{"name":"Yossup",
"address":{"city":"Macoss",
"state":"Microsoft"},
"age":28
}
}""" :: Nil)
如您所见,不同之处在于一个数据框没有年龄。
val itemsExampleDiff = spark.read.json(exampleJsonDifferentSchema)
itemsExampleDiff.show(false)
itemsExampleDiff.printSchema
+---------------------------------+---------------------------+-----------------------+
|ITEM1458 |ITEM1512 |ITEM1518 |
+---------------------------------+---------------------------+-----------------------+
|[[Macoss, Microsoft], 28, Yossup]|[[Columbus, Ohio], 28, Yin]|[[Working, Marc], Yang]|
+---------------------------------+---------------------------+-----------------------+
root
|-- ITEM1458: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1512: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
|-- ITEM1518: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- name: string (nullable = true)
我现在的解决方案是如下代码,其中我制作了一个 DataFrame 的数组:
val columns:Array[String] = itemsExample.columns
var arrayOfExampleDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = itemsExample.select(lit(col_name).as("Item"), col(col_name).as("Value"))
arrayOfExampleDFs = arrayOfExampleDFs :+ temp
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
但是当我在联合中减少时,我有一个 JSON 具有不同的架构,我不能这样做,因为数据框架需要具有相同的架构。 事实上,我有以下错误:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types 。
我正在尝试做一些我在这个问题中发现的类似事情: How to perform union on two DataFrames with different amount of columns in spark?
具体那部分:
val cols1 = df1.columns.toSet
val cols2 = df2.columns.toSet
val total = cols1 ++ cols2 // union
def expr(myCols: Set[String], allCols: Set[String]) = {
allCols.toList.map(x => x match {
case x if myCols.contains(x) => col(x)
case _ => lit(null).as(x)
})
}
但是我不能为列设置集合,因为我需要动态地捕获总计和单列的列。 我只能做这样的事情:
for(i <- 0 until arrayOfExampleDFs.length-1) {
val cols1 = arrayOfExampleDFs(i).select("Value").columns.toSet
val cols2 = arrayOfExampleDFs(i+1).select("Value").columns.toSet
val total = cols1 ++ cols2
arrayOfExampleDFs(i).select("Value").printSchema()
print(total)
}
那么,一个 function 怎么可能动态地执行这个联合呢?
更新:预计 output
在这种情况下,此数据框和模式:
+--------+---------------------------------+
|Item |Value |
+--------+---------------------------------+
|ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
|ITEM1512|[[Columbus, Ohio], 28, Yin] |
|ITEM1518|[[Working, Marc], null, Yang] |
+--------+---------------------------------+
root
|-- Item: string (nullable = false)
|-- Value: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
这是一种可能的解决方案,它通过在找不到时添加年龄列来为所有数据框创建一个通用模式:
import org.apache.spark.sql.functions.{col, lit, struct}
import org.apache.spark.sql.types.{LongType, StructField, StructType}
....
for(col_name <- columns){
val currentDf = itemsExampleDiff.select(col(col_name))
// try to identify if age field is present
val hasAge = currentDf.schema.fields(0)
.dataType
.asInstanceOf[StructType]
.fields
.contains(StructField("age", LongType, true))
val valueCol = hasAge match {
// if not construct a new value column
case false => struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
case true => col(col_name)
}
arrayOfExampleDFs = arrayOfExampleDFs :+ currentDf.select(lit(col_name).as("Item"), valueCol.as("Value"))
}
val jsonDF = arrayOfExampleDFs.reduce(_ union _)
// +--------+---------------------------------+
// |Item |Value |
// +--------+---------------------------------+
// |ITEM1458|[[Macoss, Microsoft], 28, Yossup]|
// |ITEM1512|[[Columbus, Ohio], 28, Yin] |
// |ITEM1518|[[Working, Marc],, Yang] |
// +--------+---------------------------------+
分析:可能最苛刻的部分是找出age
是否存在。 对于查找,我们使用df.schema.fields
属性,它允许我们深入了解每一列的内部模式。
当找不到年龄时,我们使用struct
重新生成列:
struct(
col(s"${col_name}.address"),
lit(null).cast("bigint").as("age"),
col(s"${col_name}.name")
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.