[英]Spark collect_set from a column using window function approach
I have a sample dataset with salaries.我有一个带薪水的样本数据集。 I want to distribute that salary into 3 buckets and then find the lower of the salary in each bucket and then convert that into an array and attach it to the original set.
我想将该薪水分配到 3 个桶中,然后在每个桶中找到较低的薪水,然后将其转换为数组并将其附加到原始集合中。 I am trying to use window function to do that.
我正在尝试使用 window function 来做到这一点。 And it seems to do it in a progressive fashion.
它似乎以一种渐进的方式做到这一点。
Here is the code that I have written这是我写的代码
val spark = sparkSession
import spark.implicits._
val simpleData = Seq(("James", "Sales", 3000),
("Michael", "Sales", 3100),
("Robert", "Sales", 3200),
("Maria", "Finance", 3300),
("James", "Sales", 3400),
("Scott", "Finance", 3500),
("Jen", "Finance", 3600),
("Jeff", "Marketing", 3700),
("Kumar", "Marketing", 3800),
("Saif", "Sales", 3900)
)
val df = simpleData.toDF("employee_name", "department", "salary")
val windowSpec = Window.orderBy("salary")
val ntileFrame = df.withColumn("ntile", ntile(3).over(windowSpec))
val lowWindowSpec = Window.partitionBy("ntile")
val ntileMinDf = ntileFrame.withColumn("lower_bound", min("salary").over(lowWindowSpec))
var rangeDf = ntileMinDf.withColumn("range", collect_set("lower_bound").over(windowSpec))
rangeDf.show()
I am getting the dataset like this我得到这样的数据集
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000| [3000]|
| Michael| Sales| 3100| 1| 3000| [3000]|
| Robert| Sales| 3200| 1| 3000| [3000]|
| Maria| Finance| 3300| 1| 3000| [3000]|
| James| Sales| 3400| 2| 3400| [3000, 3400]|
| Scott| Finance| 3500| 2| 3400| [3000, 3400]|
| Jen| Finance| 3600| 2| 3400| [3000, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
I am expecting the dataset to look like this我希望数据集看起来像这样
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
To ensure that your windows take into account all rows and not only rows before current row, you can use rowsBetween
method with Window.unboundedPreceding
and Window.unboundedFollowing
as argument.为确保您的 windows 考虑所有行,而不仅仅是当前行之前的行,您可以使用
rowsBetween
方法与Window.unboundedPreceding
和Window.unboundedFollowing
作为参数。 Your last line thus become:您的最后一行因此变为:
var rangeDf = ntileMinDf.withColumn(
"range",
collect_set("lower_bound")
.over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
)
and you get the following rangeDf
dataframe:你得到以下
rangeDf
dataframe:
+-------------+----------+------+-----+-----------+------------------+
|employee_name|department|salary|ntile|lower_bound| range|
+-------------+----------+------+-----+-----------+------------------+
| James| Sales| 3000| 1| 3000|[3000, 3700, 3400]|
| Michael| Sales| 3100| 1| 3000|[3000, 3700, 3400]|
| Robert| Sales| 3200| 1| 3000|[3000, 3700, 3400]|
| Maria| Finance| 3300| 1| 3000|[3000, 3700, 3400]|
| James| Sales| 3400| 2| 3400|[3000, 3700, 3400]|
| Scott| Finance| 3500| 2| 3400|[3000, 3700, 3400]|
| Jen| Finance| 3600| 2| 3400|[3000, 3700, 3400]|
| Jeff| Marketing| 3700| 3| 3700|[3000, 3700, 3400]|
| Kumar| Marketing| 3800| 3| 3700|[3000, 3700, 3400]|
| Saif| Sales| 3900| 3| 3700|[3000, 3700, 3400]|
+-------------+----------+------+-----+-----------+------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.