[英]Transform the schema of a spark data-frame in Scala
Input data-frame:输入数据框:
{
"C_1" : "A",
"C_2" : "B",
"C_3" : [
{
"ID" : "ID1",
"C3_C2" : "V1",
"C3_C3" : "V2"
},
{
"ID" : "ID2",
"C3_C2" : "V3",
"C3_C3" : "V4"
},
{
"ID" : "ID3",
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
]
}
Desired Output:所需的 Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID3" : {
"C3_C2" : "V4",
"C3_C3" : "V5"
},
..
}
C_3
is an array of n
structs with each item having a unique ID
. C_3
是一个由n
结构组成的数组,每个项目都有一个唯一的ID
。 The new data-frame is expected to convert the n
structs in C_3
into separate columns and name the columns as per the value of ID
.新的数据框预计将C_3
中的n
结构转换为单独的列,并根据ID
的值命名列。
I am new to Spark & Scala.我是 Spark 和 Scala 的新手。 Any thought on how of achieve this transformation will be very helpful.任何关于如何实现这种转变的想法都将非常有帮助。
Thanks!谢谢!
You can explode the structs, then pivot by ID:您可以分解结构,然后按 ID 分解 pivot:
val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
.groupBy("C_1","C_2")
.pivot("ID")
.agg(first(struct("C3_C2","C3_C3")))
df2.show
+---+---+--------+--------+--------+
|C_1|C_2| ID1| ID2| ID3|
+---+---+--------+--------+--------+
| A| B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+
df2.printSchema
root
|-- C_1: string (nullable = true)
|-- C_2: string (nullable = true)
|-- ID1: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID2: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
|-- ID3: struct (nullable = true)
| |-- C3_C2: string (nullable = true)
| |-- C3_C3: string (nullable = true)
[posting my hacky solution for reference]. [发布我的 hacky 解决方案以供参考]。
@mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. @mck 的答案可能是一种巧妙的方法,但对于我的用例来说还不够。 My data-frame had a lot of columns and using all of them on group-by
then pivot was an expensive operation.我的数据框有很多列,并且在group-by
上使用所有列,然后 pivot 是一项昂贵的操作。
For my use-case, the IDs
in C_3
were unique and known values, hence this is the assumption made in this solution.对于我的用例, C_3
中的IDs
是唯一且已知的值,因此这是此解决方案中的假设。 I was able to achieve the transformation as follows:我能够实现如下转换:
case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required
val custom_flat_func = udf((a: Seq[Row]) =>
{
var _id1: C3_Interm_Struct= null;
var _id2: C3_Interm_Struct = null;
for(item<-a)
{
val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)
if(item(0).equals("ID1")) {
_id1 = intermData
}
else if(item(0).equals("ID2")) {
_id2 = intermData
}
else if()//Manual expansion
..
}
Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
}
)
val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson
Output: Output:
{
"C_1" : "A",
"C_2" : "B",
"ID1" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
},
"ID2" : {
"C3_C2" : "V2",
"C3_C3" : "V3"
}
}
udf
s are generally slow but this is much faster than pivot
with group-by
. udf
通常很慢,但这比使用group-by
pivot
快得多。
[There may be more efficient solutions possible, I am not aware of them at the time of writing this] [可能有更有效的解决方案,在撰写本文时我还没有意识到它们]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.