我应该为 Spark Dataframe udf 中的元组使用什么数据类型？

Question

Input:输入：

val df = Seq((10, (35, 25))).toDF("id", "scorePair")
df.show
+---+---------+
| id|scorePair|
+---+---------+
| 10| [35, 25]|
+---+---------+

Expected output:预计 output：

+---+-----------+
| id|totalScore |
+---+-----------+
| 10|         60|
+---+-----------+

Wanted to do something like this, but it does not accept Row type:想做这样的事情，但它不接受Row类型：

// error
val add = udf((row: Row) => {row match {case (a: Int, b: Int) => a + b}})
df.withColumn("totalScore", add(col("scorePair")))

Why Row type is not correct thinking of为什么行类型是不正确的想法

"Dataframe is an alias for Dataset[Row]" “Dataframe 是 Dataset[Row] 的别名”

? ？

What type should I use?我应该使用什么类型？ How can I achieve it?我怎样才能实现它？

I emphasize on the type Row , because at lest I manage to use Row in the following way (which treats each cell of a column as a Row ) to achieve:我强调类型Row ，因为至少我设法通过以下方式使用Row （将列的每个单元格视为Row ）来实现：

val add = udf((rows: Seq[Row]) => {rows.map {case Row(a: Int, b: Int) => a + b}})
df.groupBy("id").agg(collect_list("scorePair") as "pairSeq").withColumn("totalScore1", add(col("pairSeq"))).select(col("id"), explode(col("totalScore1")) as "totalScore").show
+---+----------+
| id|totalScore|
+---+----------+
| 10|        60|
+---+----------+

But that's really not clean!但是那真的不干净！

Answer 1

You can use either row.getAs[Int](0) , row.get(0).asInstanceOf[Int] , row.getInt(0) to get the value from row您可以使用row.getAs[Int](0) 、 row.get(0).asInstanceOf[Int] 、 row.getInt(0)从行中获取值

val df = Seq(
  (10, (35, 25))
).toDF("id", "scorePair")


val add = udf((row: Row) => {row.getInt(0) + row.getInt(1)})

df.withColumn("totalScore", add($"scorePair")).show(false)

df.select($"id", $"scorePair._1" + $"scorePair._2" as "totalScore").show(false)

Output: Output：

+---+----------+
|id |totalScore|
+---+----------+
|10 |60        |
+---+----------+

Answer 2

The aggregate function is the easiest way to sum all the numbers in an ArrayType column. aggregate function 是对 ArrayType 列中的所有数字求和的最简单方法。 This post has a full example. 这篇文章有一个完整的例子。 Here's the snippet:这是片段：

val resDF = df.withColumn(
  "totalScore",
  aggregate(
    col("scorePair"),
    lit(0),
    (col1: Column, col2: Column) => col1 + col2
  )
)

You want to avoid UDFs whenever possible.您希望尽可能避免使用 UDF。 This solution only works for Spark 3+.此解决方案仅适用于 Spark 3+。

我应该为 Spark Dataframe udf 中的元组使用什么数据类型？

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-09-22 13:40:30

解决方案2
0 2020-09-22 13:58:17

我应该为 Spark Dataframe udf 中的元组使用什么数据类型？

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-09-22 13:40:30

解决方案2 0 2020-09-22 13:58:17

解决方案1
1 已采纳 2020-09-22 13:40:30

解决方案2
0 2020-09-22 13:58:17