简体   繁体   English

使用 window function 方法从列中激发 collect_set

[英]Spark collect_set from a column using window function approach

I have a sample dataset with salaries.我有一个带薪水的样本数据集。 I want to distribute that salary into 3 buckets and then find the lower of the salary in each bucket and then convert that into an array and attach it to the original set.我想将该薪水分配到 3 个桶中,然后在每个桶中找到较低的薪水,然后将其转换为数组并将其附加到原始集合中。 I am trying to use window function to do that.我正在尝试使用 window function 来做到这一点。 And it seems to do it in a progressive fashion.它似乎以一种渐进的方式做到这一点。

Here is the code that I have written这是我写的代码

val spark = sparkSession
import spark.implicits._
    
val simpleData = Seq(("James", "Sales", 3000),
  ("Michael", "Sales", 3100),
  ("Robert", "Sales", 3200),
  ("Maria", "Finance", 3300),
  ("James", "Sales", 3400),
  ("Scott", "Finance", 3500),
  ("Jen", "Finance", 3600),
  ("Jeff", "Marketing", 3700),
  ("Kumar", "Marketing", 3800),
  ("Saif", "Sales", 3900)
)
val df = simpleData.toDF("employee_name", "department", "salary")
val windowSpec = Window.orderBy("salary")
val ntileFrame = df.withColumn("ntile", ntile(3).over(windowSpec))
val lowWindowSpec = Window.partitionBy("ntile")
val ntileMinDf = ntileFrame.withColumn("lower_bound", min("salary").over(lowWindowSpec))
var rangeDf = ntileMinDf.withColumn("range", collect_set("lower_bound").over(windowSpec))
rangeDf.show()

I am getting the dataset like this我得到这样的数据集

+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound|             range|
+-------------+----------+------+-----+-----------+------------------+
|        James|     Sales|  3000|    1|       3000|            [3000]|
|      Michael|     Sales|  3100|    1|       3000|            [3000]|
|       Robert|     Sales|  3200|    1|       3000|            [3000]|
|        Maria|   Finance|  3300|    1|       3000|            [3000]|
|        James|     Sales|  3400|    2|       3400|      [3000, 3400]|
|        Scott|   Finance|  3500|    2|       3400|      [3000, 3400]|
|          Jen|   Finance|  3600|    2|       3400|      [3000, 3400]|
|         Jeff| Marketing|  3700|    3|       3700|[3000, 3700, 3400]|
|        Kumar| Marketing|  3800|    3|       3700|[3000, 3700, 3400]|
|         Saif|     Sales|  3900|    3|       3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+

I am expecting the dataset to look like this我希望数据集看起来像这样

+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound|             range|
+-------------+----------+------+-----+-----------+------------------+
|        James|     Sales|  3000|    1|       3000|[3000, 3700, 3400]|
|      Michael|     Sales|  3100|    1|       3000|[3000, 3700, 3400]|
|       Robert|     Sales|  3200|    1|       3000|[3000, 3700, 3400]|
|        Maria|   Finance|  3300|    1|       3000|[3000, 3700, 3400]|
|        James|     Sales|  3400|    2|       3400|[3000, 3700, 3400]|
|        Scott|   Finance|  3500|    2|       3400|[3000, 3700, 3400]|
|          Jen|   Finance|  3600|    2|       3400|[3000, 3700, 3400]|
|         Jeff| Marketing|  3700|    3|       3700|[3000, 3700, 3400]|
|        Kumar| Marketing|  3800|    3|       3700|[3000, 3700, 3400]|
|         Saif|     Sales|  3900|    3|       3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+

To ensure that your windows take into account all rows and not only rows before current row, you can use rowsBetween method with Window.unboundedPreceding and Window.unboundedFollowing as argument.为确保您的 windows 考虑所有行,而不仅仅是当前行之前的行,您可以使用rowsBetween方法与Window.unboundedPrecedingWindow.unboundedFollowing作为参数。 Your last line thus become:您的最后一行因此变为:

var rangeDf = ntileMinDf.withColumn(
  "range",
  collect_set("lower_bound")
     .over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)

and you get the following rangeDf dataframe:你得到以下rangeDf dataframe:

+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound|             range|
+-------------+----------+------+-----+-----------+------------------+
|        James|     Sales|  3000|    1|       3000|[3000, 3700, 3400]|
|      Michael|     Sales|  3100|    1|       3000|[3000, 3700, 3400]|
|       Robert|     Sales|  3200|    1|       3000|[3000, 3700, 3400]|
|        Maria|   Finance|  3300|    1|       3000|[3000, 3700, 3400]|
|        James|     Sales|  3400|    2|       3400|[3000, 3700, 3400]|
|        Scott|   Finance|  3500|    2|       3400|[3000, 3700, 3400]|
|          Jen|   Finance|  3600|    2|       3400|[3000, 3700, 3400]|
|         Jeff| Marketing|  3700|    3|       3700|[3000, 3700, 3400]|
|        Kumar| Marketing|  3800|    3|       3700|[3000, 3700, 3400]|
|         Saif|     Sales|  3900|    3|       3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM