Spark SQL - Scala - 聚合 Function 作为创建 DF 列的参数

Question

我正在尝试创建一个 function 作为主要参数传递：

一个 DataFrame
另一个 function（聚合：count、countDistinct、max 等）

我的目标是返回一个 DataFrame 和一个基于提供的 function 的新列。

不过，我在打字时遇到了麻烦。 我一直在这里搜索，我发现的大部分内容都指向 UDF，并且需要创建它以便在“withColumn”中应用它。

当我运行这样的事情时：

    val DF1 = Seq(
  ("asd", "1", "search", "otpx"),
  ("asd", "1", "xpto", "otpx"),
  ("asd", "2", "xpto", "otpx"),
  ("asd", "3", "xpto", "otpx"),
  ("asd", "3", "search", "otpx"),
  ("asd", "4", "search", "otpx"),

  ("zxc", "1", "search", "otpx"),
  ("zxc", "1", "search", "otpx"),
  ("zxc", "1", "search", "otpx"),
  ("zxc", "1", "search", "otpx"),
  ("zxc", "2", "xpto", "otpx"),
  ("zxc", "3", "xpto", "otpx"),
  ("zxc", "3", "xpto", "otpx"),
  ("zxc", "3", "xpto", "otpx"),

  ("qwe", "1", "xpto", "otpx"),
  ("qwe", "2", "xpto", "otpx"),
  ("qwe", "3", "xpto", "otpx"),
  ("qwe", "4", "xpto", "otpx"),
  ("qwe", "5", "xpto", "otpx")

).toDF("cid", "cts", "type", "subtype")

DF1.show(100)

val canList = List("cid", "cts")

def test[T](df: DataFrame, fn: Column => T, newColName: String, colToFn: String, partitionByColumns: List[String]): DataFrame = {

  val window = Window.partitionBy(partitionByColumns.head, partitionByColumns.tail:_*)

  val fun: (Column => T) = (arg: Column) => fn(arg) // or right away udfFun = udf(fn)

  val udfFun = udf(fun)

  val ret = df.withColumn(newColName, udfFun(df(colToFn)).over(window))

  ret
}

val DF2 = test(DF1, countDistinct, "count_type", "type", canList)

DF2.orderBy(canList.head, canList.tail:_*).show(100)

我收到如下错误：

没有可用于 T 的 TypeTag

val udfFun = udf（有趣）

我在这里想念什么？

提前感谢，干杯！

Answer 1

首先请注意， countDistinct不支持 countDistinct。 如果你想定义一个 function 来接受其他聚合函数而不是 window （比如count ），你可以将fn定义为 function ，它接受一个列并返回一个列。 UDF 在这里不合适，因为您正在调用 Spark SQL 函数，而不是自定义 Scala 函数。

def test(df: DataFrame,
         fn: Column => Column,
         newColName: String,
         colToFn: String,
         partitionByColumns: List[String]
): DataFrame = {
  val window = Window.partitionBy(partitionByColumns.head, partitionByColumns.tail:_*)
  val ret = df.withColumn(newColName, fn(col(colToFn)).over(window))
  ret
}

// calling the function
test(DF1, count, "count_type", "type", canList)

Spark SQL - Scala - 聚合 Function 作为创建 DF 列的参数

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-07 19:37:17

Spark SQL - Scala - 聚合 Function 作为创建 DF 列的参数

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-07 19:37:17

解决方案1
1 已采纳 2021-01-07 19:37:17