简体   繁体   English

如何在 Spark 1.6 的窗口聚合中使用 collect_set 和 collect_list 函数?

[英]How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB") ?在 Spark 1.6.0 / Scala 中,是否有机会获得collect_list("colC")collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")

Given that you have dataframe as鉴于您的dataframe

+----+----+----+
|colA|colB|colC|
+----+----+----+
|1   |1   |23  |
|1   |2   |63  |
|1   |3   |31  |
|2   |1   |32  |
|2   |2   |56  |
+----+----+----+

You can Window functions by doing the following您可以通过执行以下操作来使用Window功能

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)

Result:结果:

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[23, 63]    |
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

Similar is the result for collect_set as well. collect_set的结果也类似。 But the order of elements in the final set will not be in order as with collect_list但是最终set中元素的顺序不会像collect_list那样有序

df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23]        |
|1   |2   |63  |[63, 23]    |
|1   |3   |31  |[63, 31, 23]|
|2   |1   |32  |[32]        |
|2   |2   |56  |[56, 32]    |
+----+----+----+------------+

If you remove orderBy as below如果您删除orderBy如下

df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)

result would be结果将是

+----+----+----+------------+
|colA|colB|colC|colD        |
+----+----+----+------------+
|1   |1   |23  |[23, 63, 31]|
|1   |2   |63  |[23, 63, 31]|
|1   |3   |31  |[23, 63, 31]|
|2   |1   |32  |[32, 56]    |
|2   |2   |56  |[32, 56]    |
+----+----+----+------------+

I hope the answer is helpful我希望这个答案有帮助

Existing answer is valid, just adding here a different style of writting window functions:现有答案有效,只是在这里添加一种不同风格的窗口函数编写方式:

import org.apache.spark.sql.expressions.Window

val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)

df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM