[英]Add Column to DataFrame With Aggregated Values
I have a DataFrame as follows 我有一个DataFrame如下
profDF
+---+------------+---------+------+
| ID| Name| Occ|Salary|
+---+------------+---------+------+
| 1| James|Detective| 30000|
| 2| Victor| Salesman| 50000|
| 3| Doris| CEO| 20000|
+---+------------+---------+------+
I want to add a new column that contains the difference between the maximum Salary and the Salary of each person. 我想添加一个新列,其中包含最高工资和每个人的工资之间的差额。
+---+------------+---------+------+-------+
| ID| Name| Occ|Salary|DiffMax|
+---+------------+---------+------+-------+
| 1| James|Detective| 30000| 20000|
| 2| Victor| Salesman| 50000| 0|
| 3| Doris| CEO| 20000| 30000|
+---+------------+---------+------+-------+
One way to do this would be create another DF by doing a groupBy("ID")
and max
and then join this DF with persDF
on "ID" but groupBy
would not give me the max Salary of all the rows. 一种方法是通过执行groupBy("ID")
和max
创建另一个DF,然后将该DF与“ ID”上的groupBy
结合persDF
,但是groupBy
不会给我所有行的最大薪水。
Another way would be to use withColumn("DiffMax", ...)
. 另一种方法是使用withColumn("DiffMax", ...)
。 But I can't seem to be able to find out the second parameter to withColumn
that would give me the desired result. 但是我似乎无法找到withColumn
的第二个参数,该参数将给我所需的结果。
Could someone help me with this? 有人可以帮我吗? I'm using Spark-1.6.0 我正在使用Spark-1.6.0
This is one way of doing it. 这是一种方法。 Find the max
Salary and then using withColumn
find the difference between existing Salary and this max
Salary. 找到max
薪水,然后使用withColumn
查找现有薪水和此max
薪水之间的差异。
val maxSalary = profDF.agg(max(profDF("Salary"))).first().get(0)
profDF.withColumn("DiffMax", lit(maxSalary) - profDF("Salary")).show()
//output
+---+------+---------+------+-------+
| ID| Name| Occ|Salary|DiffMax|
+---+------+---------+------+-------+
| 1| James|Detective| 30000|20000.0|
| 2|Victor| Salesman| 50000| 0.0|
| 3| Doris| CEO| 20000|30000.0|
+---+------+---------+------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.