如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数？

Question

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB") ?在 Spark 1.6.0 / Scala 中，是否有机会获得collect_list("colC")或collect_set("colC").over(Window.partitionBy("colA").orderBy("colB") ？

Answer 1

Given that you have dataframe as鉴于您的dataframe为

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+

You can Window functions by doing the following您可以通过执行以下操作来使用Window功能

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)

Result:结果：

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

Similar is the result for collect_set as well. collect_set的结果也类似。 But the order of elements in the final set will not be in order as with collect_list但是最终set中元素的顺序不会像collect_list那样有序

df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+

If you remove orderBy as below如果您删除orderBy如下

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)

result would be结果将是

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

I hope the answer is helpful我希望这个答案有帮助

Answer 2

Existing answer is valid, just adding here a different style of writting window functions:现有答案有效，只是在这里添加一种不同风格的窗口函数编写方式：

import org.apache.spark.sql.expressions.Window

val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)

df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)

如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数？

问题描述

2 个解决方案

解决方案1
32 2017-07-17 02:00:04

解决方案2
0 2020-12-27 19:18:55

如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数？

问题描述

2 个解决方案

解决方案1 32 2017-07-17 02:00:04

解决方案2 0 2020-12-27 19:18:55

解决方案1
32 2017-07-17 02:00:04

解决方案2
0 2020-12-27 19:18:55