简体   繁体   English

将列添加到具有汇总值的DataFrame中

[英]Add Column to DataFrame With Aggregated Values

I have a DataFrame as follows 我有一个DataFrame如下

profDF
+---+------------+---------+------+
| ID|        Name|      Occ|Salary|
+---+------------+---------+------+
|  1|       James|Detective| 30000|
|  2|      Victor| Salesman| 50000|
|  3|       Doris|      CEO| 20000|
+---+------------+---------+------+

I want to add a new column that contains the difference between the maximum Salary and the Salary of each person. 我想添加一个新列,其中包含最高工资和每个人的工资之间的差额。

+---+------------+---------+------+-------+
| ID|        Name|      Occ|Salary|DiffMax|
+---+------------+---------+------+-------+
|  1|       James|Detective| 30000|  20000|
|  2|      Victor| Salesman| 50000|      0|
|  3|       Doris|      CEO| 20000|  30000|
+---+------------+---------+------+-------+

One way to do this would be create another DF by doing a groupBy("ID") and max and then join this DF with persDF on "ID" but groupBy would not give me the max Salary of all the rows. 一种方法是通过执行groupBy("ID")max创建另一个DF,然后将该DF与“ ID”上的groupBy结合persDF ,但是groupBy不会给我所有行的最大薪水。

Another way would be to use withColumn("DiffMax", ...) . 另一种方法是使用withColumn("DiffMax", ...) But I can't seem to be able to find out the second parameter to withColumn that would give me the desired result. 但是我似乎无法找到withColumn的第二个参数,该参数将给我所需的结果。

Could someone help me with this? 有人可以帮我吗? I'm using Spark-1.6.0 我正在使用Spark-1.6.0

This is one way of doing it. 这是一种方法。 Find the max Salary and then using withColumn find the difference between existing Salary and this max Salary. 找到max薪水,然后使用withColumn查找现有薪水和此max薪水之间的差异。

val maxSalary = profDF.agg(max(profDF("Salary"))).first().get(0)

profDF.withColumn("DiffMax", lit(maxSalary) - profDF("Salary")).show()

//output

+---+------+---------+------+-------+
| ID|  Name|      Occ|Salary|DiffMax|
+---+------+---------+------+-------+
|  1| James|Detective| 30000|20000.0|
|  2|Victor| Salesman| 50000|    0.0|
|  3| Doris|      CEO| 20000|30000.0|
+---+------+---------+------+-------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据另一个数据框的行值在数据框中添加新列 - add new column in a dataframe depending on another dataframe's row values Spark-合并后,聚合列从DataFrame中消失 - Spark - aggregated column disappears from a DataFrame after join 根据以前的值和条件向数据框添加新列 - Add new column to dataframe based on previous values and condition 如何将 List[String] 值添加到 Dataframe 中的单个列 - How to add List[String] values to a single column in Dataframe 如何基于数据框中的最大值和最小值添加列集合 - How to add a column collection based on the maximum and minimum values in a dataframe 关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列 - About how to add a new column to an existing DataFrame with random values in Scala 如何将包含值 0…n 的列添加到 spark 中的现有 dataframe? - How to add column containing values 0…n to existing dataframe in spark? 如何在Spark中的数据框中添加列,其值取决于第2个数据帧的内容? - How can I add a column to a dataframe in Spark, whose values will depend on the contents of a 2nd dataframe? 根据参数SQL查询,在Spark Dataframe中添加列,该查询取决于DataFrame某些字段的值 - Add Column in a Spark Dataframe ,based on a parametric sql query dependent on values of some fields of the dataframe 在Spark DataFrame中添加一个新列,其中包含一个列的所有值的总和-Scala / Spark - Add a new Column in Spark DataFrame which contains the sum of all values of one column-Scala/Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM