将列添加到具有汇总值的DataFrame中

Question

I have a DataFrame as follows 我有一个DataFrame如下

profDF
+---+------------+---------+------+
| ID|        Name|      Occ|Salary|
+---+------------+---------+------+
|  1|       James|Detective| 30000|
|  2|      Victor| Salesman| 50000|
|  3|       Doris|      CEO| 20000|
+---+------------+---------+------+

I want to add a new column that contains the difference between the maximum Salary and the Salary of each person. 我想添加一个新列，其中包含最高工资和每个人的工资之间的差额。

+---+------------+---------+------+-------+
| ID|        Name|      Occ|Salary|DiffMax|
+---+------------+---------+------+-------+
|  1|       James|Detective| 30000|  20000|
|  2|      Victor| Salesman| 50000|      0|
|  3|       Doris|      CEO| 20000|  30000|
+---+------------+---------+------+-------+

One way to do this would be create another DF by doing a groupBy("ID") and max and then join this DF with persDF on "ID" but groupBy would not give me the max Salary of all the rows. 一种方法是通过执行groupBy("ID")和max创建另一个DF，然后将该DF与“ ID”上的groupBy结合persDF ，但是groupBy不会给我所有行的最大薪水。

Another way would be to use withColumn("DiffMax", ...) . 另一种方法是使用withColumn("DiffMax", ...) 。 But I can't seem to be able to find out the second parameter to withColumn that would give me the desired result. 但是我似乎无法找到withColumn的第二个参数，该参数将给我所需的结果。

Could someone help me with this? 有人可以帮我吗？ I'm using Spark-1.6.0 我正在使用Spark-1.6.0

Answer 1

This is one way of doing it. 这是一种方法。 Find the max Salary and then using withColumn find the difference between existing Salary and this max Salary. 找到max薪水，然后使用withColumn查找现有薪水和此max薪水之间的差异。

val maxSalary = profDF.agg(max(profDF("Salary"))).first().get(0)

profDF.withColumn("DiffMax", lit(maxSalary) - profDF("Salary")).show()

//output

+---+------+---------+------+-------+
| ID|  Name|      Occ|Salary|DiffMax|
+---+------+---------+------+-------+
|  1| James|Detective| 30000|20000.0|
|  2|Victor| Salesman| 50000|    0.0|
|  3| Doris|      CEO| 20000|30000.0|
+---+------+---------+------+-------+

将列添加到具有汇总值的DataFrame中

问题描述

1 个解决方案

解决方案1
4 已采纳 2018-02-02 06:30:09

将列添加到具有汇总值的DataFrame中

问题描述

1 个解决方案

解决方案1 4 已采纳 2018-02-02 06:30:09

解决方案1
4 已采纳 2018-02-02 06:30:09