简体   繁体   English

在 Scala 中转换 spark 数据帧的模式

[英]Transform the schema of a spark data-frame in Scala

Input data-frame:输入数据框:

{ 
  "C_1" : "A",
  "C_2" : "B",
  "C_3" : [ 
            {
              "ID" : "ID1",
              "C3_C2" : "V1",
              "C3_C3" : "V2"
            },
            {
              "ID" : "ID2",
              "C3_C2" : "V3",
              "C3_C3" : "V4"
            },
            {
              "ID" : "ID3",
              "C3_C2" : "V4",
              "C3_C3" : "V5"
            },
            ..
          ]
}

Desired Output:所需的 Output:

{ 
  "C_1" : "A",
  "C_2" : "B",
  "ID1" : {
              "C3_C2" : "V2",
              "C3_C3" : "V3"
          },
  "ID2" : {
              "C3_C2" : "V2",
              "C3_C3" : "V3"
          },
  "ID3" : {
              "C3_C2" : "V4",
              "C3_C3" : "V5"
          },
  ..
}

C_3 is an array of n structs with each item having a unique ID . C_3是一个由n结构组成的数组,每个项目都有一个唯一的ID The new data-frame is expected to convert the n structs in C_3 into separate columns and name the columns as per the value of ID .新的数据框预计将C_3中的n结构转换为单独的列,并根据ID的值命名列。

I am new to Spark & Scala.我是 Spark 和 Scala 的新手。 Any thought on how of achieve this transformation will be very helpful.任何关于如何实现这种转变的想法都将非常有帮助。

Thanks!谢谢!

You can explode the structs, then pivot by ID:您可以分解结构,然后按 ID 分解 pivot:

val df2 = df.selectExpr("C_1","C_2","inline(C_3)")
            .groupBy("C_1","C_2")
            .pivot("ID")
            .agg(first(struct("C3_C2","C3_C3")))

df2.show
+---+---+--------+--------+--------+
|C_1|C_2|     ID1|     ID2|     ID3|
+---+---+--------+--------+--------+
|  A|  B|[V1, V2]|[V3, V4]|[V4, V5]|
+---+---+--------+--------+--------+

df2.printSchema
root
 |-- C_1: string (nullable = true)
 |-- C_2: string (nullable = true)
 |-- ID1: struct (nullable = true)
 |    |-- C3_C2: string (nullable = true)
 |    |-- C3_C3: string (nullable = true)
 |-- ID2: struct (nullable = true)
 |    |-- C3_C2: string (nullable = true)
 |    |-- C3_C3: string (nullable = true)
 |-- ID3: struct (nullable = true)
 |    |-- C3_C2: string (nullable = true)
 |    |-- C3_C3: string (nullable = true)

[posting my hacky solution for reference]. [发布我的 hacky 解决方案以供参考]。
@mck's answer is probably the neat way to do it but it wasn't sufficient for my use-case. @mck 的答案可能是一种巧妙的方法,但对于我的用例来说还不够。 My data-frame had a lot of columns and using all of them on group-by then pivot was an expensive operation.我的数据框有很多列,并且在group-by上使用所有列,然后 pivot 是一项昂贵的操作。
For my use-case, the IDs in C_3 were unique and known values, hence this is the assumption made in this solution.对于我的用例, C_3中的IDs是唯一且已知的值,因此这是此解决方案中的假设。 I was able to achieve the transformation as follows:我能够实现如下转换:

case class C3_Interm_Struct(C3_C2: String, C3_C3: String)
case class C3_Out_Struct(ID1: C3_Interm_Struct, ID2: C3_Interm_Struct) //Manually add fields as required

val custom_flat_func = udf((a: Seq[Row]) =>
    {
        var _id1: C3_Interm_Struct= null;
        var _id2: C3_Interm_Struct = null;
        for(item<-a)
        {
            val intermData = C3_Interm_Struct(item(1).toString, item(2).toString)

            if(item(0).equals("ID1")) {
                _id1 = intermData
            }
            else if(item(0).equals("ID2")) {
                _id2 = intermData
            }
            else if()//Manual expansion
                ..
        }

        Seq(C3_Out_Struct(_id1, _id2)) //Return type has to be Seq
    }
)

val flatDf = df.withColumn("C_3", custom_flat_func($"C_3")).selectExpr("C_1", "C_2", "inline(C_3)") //Expand the Seq which has only 1 Row
flatDf.first.prettyJson

Output: Output:

{ 
  "C_1" : "A",
  "C_2" : "B",
  "ID1" : {
              "C3_C2" : "V2",
              "C3_C3" : "V3"
          },
  "ID2" : {
              "C3_C2" : "V2",
              "C3_C3" : "V3"
          }
}

udf s are generally slow but this is much faster than pivot with group-by . udf通常很慢,但这比使用group-by pivot快得多。

[There may be more efficient solutions possible, I am not aware of them at the time of writing this] [可能有更有效的解决方案,在撰写本文时我还没有意识到它们]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM