简体   繁体   English

将Spark Dataframe列转换为仅一行的数据框(ArrayType)

[英]Transforming a Spark Dataframe Column into a Dataframe with just one line (ArrayType)

I have a dataframe that contains a column with complex objects in it: 我有一个数据框,其中包含带有复杂对象的列:

+--------+
|col1    |
+--------+
|object1 |
|object2 | 
|object3 |    
+--------+

The schema of this object is pretty complex, something that looks like: 该对象的架构非常复杂,如下所示:

root:struct
    field1:string
    field2:decimal(38,18)
    object1:struct
        field3:string
        object2:struct
            field4:string
            field5:decimal(38,18)

What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么?

eg: 例如:

+-----------------------------+
|col1                         |
+-----------------------------+
| [object1, object2, object3] |    
+-----------------------------+

I tried to generate an array from a column then create a dataframe from it: 我试图从一列生成一个数组,然后从中创建一个数据框:

final case class A(b: Array[Any])

val c = df.select("col1").collect().map(_(0)).toArray

df.sparkSession.createDataset(Seq(A(b = c)))

However, Spark doesn't like my Array[Any] trick: 但是,Spark不喜欢我的Array[Any]技巧:

java.lang.ClassNotFoundException: scala.Any

Any ideas? 有任何想法吗?

Spark uses encoders for datatypes, this is the reason Any doesn't work. Spark将编码器用于数据类型,这就是Any不起作用的原因。

If the schema of the complex object is fixed, you can define a case class with that schema and do the following, 如果复杂对象的模式是固定的,则可以使用该模式定义case class ,然后执行以下操作:

case class C(... object1: A, object2: B ...)

val df = ???

val mappedDF = df.as[C] // this will map each complex object to case class

Next, you can use a UDF to change each C object to Seq(...) on row level. 接下来,您可以使用UDF在行级将每个C对象更改为Seq(...) It'll look something like, 看起来像

import org.apache.spark.sql.expressions.{UserDefinedFunction => UDF}
import org.apache.spark.sql.functions.col

 def convert: UDF =
    udf((complexObj: C) => Seq(complexObj.object1,complexObj.object2,complexObj.object3))

To use this UDF , 要使用此UDF

mappedDF.withColumn("resultColumn", convert(col("col1")))

Note: Since not much info was provided about the schema, I've used generics like A and B. You will have to define all of these. 注意:由于没有提供太多有关模式的信息,因此我使用了A和B等泛型。您将必须定义所有这些泛型。

What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么?

There is not even a good way to do it. 甚至没有一个好方法。 Please remember that Spark cannot distribute individual rows. 请记住,Spark无法分配单独的行。 The result will be: 结果将是:

  • Processed sequentially. 按顺序处理。
  • Possibly to large to be stored in memory. 可能很大,要存储在内存中。

Other than the above you can just collect_list : 除了上述之外,您还可以collect_list

import org.apache.spark.sql.functions.{col, collect_list}

df.select(collect_list(col("col1"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM