[英]How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?
In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC")
or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")
?在 Spark 1.6.0 / Scala 中,是否有机会获得
collect_list("colC")
或collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")
?
Given that you have dataframe
as鉴于您的
dataframe
为
+----+----+----+
|colA|colB|colC|
+----+----+----+
|1 |1 |23 |
|1 |2 |63 |
|1 |3 |31 |
|2 |1 |32 |
|2 |2 |56 |
+----+----+----+
You can Window
functions by doing the following您可以通过执行以下操作来使用
Window
功能
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
Result:结果:
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[23, 63] |
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
Similar is the result for collect_set
as well. collect_set
的结果也类似。 But the order of elements in the final set
will not be in order as with collect_list
但是最终
set
中元素的顺序不会像collect_list
那样有序
df.withColumn("colD", collect_set("colC").over(Window.partitionBy("colA").orderBy("colB"))).show(false)
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23] |
|1 |2 |63 |[63, 23] |
|1 |3 |31 |[63, 31, 23]|
|2 |1 |32 |[32] |
|2 |2 |56 |[56, 32] |
+----+----+----+------------+
If you remove orderBy
as below如果您删除
orderBy
如下
df.withColumn("colD", collect_list("colC").over(Window.partitionBy("colA"))).show(false)
result would be结果将是
+----+----+----+------------+
|colA|colB|colC|colD |
+----+----+----+------------+
|1 |1 |23 |[23, 63, 31]|
|1 |2 |63 |[23, 63, 31]|
|1 |3 |31 |[23, 63, 31]|
|2 |1 |32 |[32, 56] |
|2 |2 |56 |[32, 56] |
+----+----+----+------------+
I hope the answer is helpful我希望这个答案有帮助
Existing answer is valid, just adding here a different style of writting window functions:现有答案有效,只是在这里添加一种不同风格的窗口函数编写方式:
import org.apache.spark.sql.expressions.Window
val wind_user = Window.partitionBy("colA", "colA2").orderBy("colB", "colB2".desc)
df.withColumn("colD_distinct", collect_set($"colC") over wind_user)
.withColumn("colD_historical", collect_list($"colC") over wind_user).show(false)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.