将Spark Dataframe列转换为仅一行的数据框（ArrayType）

Question

I have a dataframe that contains a column with complex objects in it: 我有一个数据框，其中包含带有复杂对象的列：

+--------+
|col1    |
+--------+
|object1 |
|object2 | 
|object3 |    
+--------+

The schema of this object is pretty complex, something that looks like: 该对象的架构非常复杂，如下所示：

root:struct
    field1:string
    field2:decimal(38,18)
    object1:struct
        field3:string
        object2:struct
            field4:string
            field5:decimal(38,18)

What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么？

eg: 例如：

+-----------------------------+
|col1                         |
+-----------------------------+
| [object1, object2, object3] |    
+-----------------------------+

I tried to generate an array from a column then create a dataframe from it: 我试图从一列生成一个数组，然后从中创建一个数据框：

final case class A(b: Array[Any])

val c = df.select("col1").collect().map(_(0)).toArray

df.sparkSession.createDataset(Seq(A(b = c)))

However, Spark doesn't like my Array[Any] trick: 但是，Spark不喜欢我的Array[Any]技巧：

java.lang.ClassNotFoundException: scala.Any

Any ideas? 有任何想法吗？

Answer 1

Spark uses encoders for datatypes, this is the reason Any doesn't work. Spark将编码器用于数据类型，这就是Any不起作用的原因。

If the schema of the complex object is fixed, you can define a case class with that schema and do the following, 如果复杂对象的模式是固定的，则可以使用该模式定义case class ，然后执行以下操作：

case class C(... object1: A, object2: B ...)

val df = ???

val mappedDF = df.as[C] // this will map each complex object to case class

Next, you can use a UDF to change each C object to Seq(...) on row level. 接下来，您可以使用UDF在行级将每个C对象更改为Seq(...) 。 It'll look something like, 看起来像

import org.apache.spark.sql.expressions.{UserDefinedFunction => UDF}
import org.apache.spark.sql.functions.col

 def convert: UDF =
    udf((complexObj: C) => Seq(complexObj.object1,complexObj.object2,complexObj.object3))

To use this UDF , 要使用此UDF ，

mappedDF.withColumn("resultColumn", convert(col("col1")))

Note: Since not much info was provided about the schema, I've used generics like A and B. You will have to define all of these. 注意：由于没有提供太多有关模式的信息，因此我使用了A和B等泛型。您将必须定义所有这些泛型。

Answer 2

What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么？

There is not even a good way to do it. 甚至没有一个好方法。 Please remember that Spark cannot distribute individual rows. 请记住，Spark无法分配单独的行。 The result will be: 结果将是：

Processed sequentially. 按顺序处理。
Possibly to large to be stored in memory. 可能很大，要存储在内存中。

Other than the above you can just collect_list : 除了上述之外，您还可以collect_list ：

import org.apache.spark.sql.functions.{col, collect_list}

df.select(collect_list(col("col1"))

将Spark Dataframe列转换为仅一行的数据框（ArrayType）

问题描述

2 个解决方案

解决方案1
1 2018-07-11 06:17:39

解决方案2
1 已采纳 2018-07-11 09:53:17

将Spark Dataframe列转换为仅一行的数据框（ArrayType）

问题描述

2 个解决方案

解决方案1 1 2018-07-11 06:17:39

解决方案2 1 已采纳 2018-07-11 09:53:17

解决方案1
1 2018-07-11 06:17:39

解决方案2
1 已采纳 2018-07-11 09:53:17