[英]Transforming a Spark Dataframe Column into a Dataframe with just one line (ArrayType)
I have a dataframe that contains a column with complex objects in it: 我有一个数据框,其中包含带有复杂对象的列:
+--------+
|col1 |
+--------+
|object1 |
|object2 |
|object3 |
+--------+
The schema of this object is pretty complex, something that looks like: 该对象的架构非常复杂,如下所示:
root:struct
field1:string
field2:decimal(38,18)
object1:struct
field3:string
object2:struct
field4:string
field5:decimal(38,18)
What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么?
eg: 例如:
+-----------------------------+
|col1 |
+-----------------------------+
| [object1, object2, object3] |
+-----------------------------+
I tried to generate an array from a column then create a dataframe from it: 我试图从一列生成一个数组,然后从中创建一个数据框:
final case class A(b: Array[Any])
val c = df.select("col1").collect().map(_(0)).toArray
df.sparkSession.createDataset(Seq(A(b = c)))
However, Spark doesn't like my Array[Any]
trick: 但是,Spark不喜欢我的Array[Any]
技巧:
java.lang.ClassNotFoundException: scala.Any
Any ideas? 有任何想法吗?
Spark uses encoders for datatypes, this is the reason Any
doesn't work. Spark将编码器用于数据类型,这就是Any
不起作用的原因。
If the schema of the complex object is fixed, you can define a case class
with that schema and do the following, 如果复杂对象的模式是固定的,则可以使用该模式定义case class
,然后执行以下操作:
case class C(... object1: A, object2: B ...)
val df = ???
val mappedDF = df.as[C] // this will map each complex object to case class
Next, you can use a UDF
to change each C
object to Seq(...)
on row level. 接下来,您可以使用UDF
在行级将每个C
对象更改为Seq(...)
。 It'll look something like, 看起来像
import org.apache.spark.sql.expressions.{UserDefinedFunction => UDF}
import org.apache.spark.sql.functions.col
def convert: UDF =
udf((complexObj: C) => Seq(complexObj.object1,complexObj.object2,complexObj.object3))
To use this UDF
, 要使用此UDF
,
mappedDF.withColumn("resultColumn", convert(col("col1")))
Note: Since not much info was provided about the schema, I've used generics like A and B. You will have to define all of these. 注意:由于没有提供太多有关模式的信息,因此我使用了A和B等泛型。您将必须定义所有这些泛型。
What is the best way to group everything and transform it into an array? 将所有内容分组并将其转换为数组的最佳方法是什么?
There is not even a good way to do it. 甚至没有一个好方法。 Please remember that Spark cannot distribute individual rows. 请记住,Spark无法分配单独的行。 The result will be: 结果将是:
Other than the above you can just collect_list
: 除了上述之外,您还可以collect_list
:
import org.apache.spark.sql.functions.{col, collect_list}
df.select(collect_list(col("col1"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.