简体   繁体   English

将 ArrayType 列传递给 Spark Scala 中的 UDF

[英]Pass a ArrayType column to UDF in Spark Scala

I have a column in my Spark dataframe in Scala that was generated as a result of aggregration of multiple columns using我在 Scala 的 Spark 数据框中有一列,它是由于使用多列聚合而生成的

 agg(collect_list(struct(col(abc), col(aaa)).as(def)

I want to pass this column to a UDF for further processing to work on one one of the index in this aggregated column.我想将此列传递给 UDF 以进行进一步处理,以处理此聚合列中的一个索引。

When I pass argument to my UDF as:当我将参数传递给我的 UDF 时:

.withColumn(def, remove
            (col(xyz), col(def)))

UDF- Type as Seq[Row]: val removeUnstableActivations: UserDefinedFunction = udf((xyz: java.util.Date, def: Seq[Row]) UDF- 类型为 Seq[Row]: val removeUnstableActivations: UserDefinedFunction = udf((xyz: java.util.Date, def: Seq[Row])

I get the error:我收到错误:

Exception encountered when invoking run on a nested suite - Schema for type org.apache.spark.sql.Row is not supported

How should I pass this columns and what should be the datatype of the column in UDF?我应该如何传递这些列以及 UDF 中列的数据类型应该是什么?

Indeed schema for type Row is not supported but you can return a case class.确实不支持类型 Row 的模式,但您可以返回一个案例类。 Spark will treat returned case class as StructType. Spark 会将返回的 case 类视为 StructType。 Eg:例如:

import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.Row

val df = Seq(
  (1, "a"),
  (2, "b"),
  (3, "c")
).toDF("number", "word")

val aggDf = df.agg(
  collect_list(struct(col("number"), col("word"))) as "aggColumn"
)

aggDf.printSchema()
// |-- aggColumn: array (nullable = true)
// |    |-- element: struct (containsNull = true)
// |    |    |-- number: string (nullable = true)
// |    |    |-- word: integer (nullable = false)

case class ReturnSchema(word: String, number: Int)

val myUdf: UserDefinedFunction =
  udf((collection: Seq[Row]) => {
    collection.map(r => {
      val word   = r.getAs[String]("word")
      val newNumber = r.getAs[Int]("number") * 100

      new ReturnSchema(word, newNumber)
    })
  })
  
val finalDf = aggDf.select(myUdf(col("aggColumn")).as("udfTranformedColumn"))

finalDf.printSchema
// root
//  |-- udfTranformedColumn: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- word: string (nullable = true)
//  |    |    |-- number: integer (nullable = false)

finalDf.show(false)
// +------------------------------+
// |udfTranformedColumn           |
// +------------------------------+
// |[[a, 100], [b, 200], [c, 300]]|
// +------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM