比较当前行和上一行的值，如果在 Spark 中需要，则比较列的值

Question

I am trying to select the value of one column based on the values of other rows and other columns.我正在尝试 select 基于其他行和其他列的值的一列的值。

scala> val df = Seq((1,"051",0,0,10,0),(1,"052",0,0,0,0),(2,"053",10,0,10,0),(2,"054",0,0,10,0),(3,"055",100,50,0,0),(3,"056",100,10,0,0),(3,"057",100,20,0,0),(4,"058",70,15,0,0),(4,"059",70,15,0,20),(4,"060",70,15,0,0)).toDF("id","code","value_1","value_2","value_3","Value_4")
scala> df.show()
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0| 
|  3| 057|    100|     20|      0|      0| 
|  4| 058|     70|     15|      0|      0| 
|  4| 059|     70|     15|      0|     20| 
|  4| 060|     70|     15|      0|      0| 
+---+----+-------+-------+-------+-------+

Calculation Logic:计算逻辑：

Select a code for an id, following the steps Select 一个 id 的代码，按照步骤

For each column n(value_1,value_2,value_3,value_4), do对于每一列 n(value_1,value_2,value_3,value_4)，执行
For the same id look for the maximum value in the value_n column对于相同的 id，在 value_n 列中查找最大值
If the maximum value is repeated, the next column is evaluated如果最大值重复，则评估下一列
Otherwise, if the maximum value is found without repetition, the id and the code of the column with the maximum value are taken.否则，如果没有重复找到最大值，则取最大值列的id和code。 It is no longer necessary to evaluate the following columns.不再需要评估以下列。

Expected Output:预期 Output：

+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  2| 053|     10|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
+---+----+-------+-------+-------+-------+

In case of id 3:如果是 id 3：

It has the codes 055, 056, 057它的代码为 055、056、057
value_1 has the values 100 for all three codes, so the maximum value is 100 but it repeats for all three codes, I can't select a code. value_1 所有三个代码的值都是 100，因此最大值为 100，但它对所有三个代码都重复，我不能 select 代码。
The value_2 column has to be evaluated, which has the values 50,10 and 20 for each code respectively必须评估 value_2 列，每个代码的值分别为 50,10 和 20
So the maximum value is 50 among the three codes, and it is unique.所以三个码中最大值为50，是唯一的。
The record with id 3 and code 055 is selected id 为 3 且代码为 055 的记录被选中

Please help.请帮忙。

Answer 1

You can put your value_1 to 4 in a struct and call max function on it groupedBy id column using window您可以将 value_1 设置为 4 并使用 window 在其 groupedBy id 列上调用 max function


scala> df.show
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0|
|  3| 057|    100|     20|      0|      0|
|  4| 058|     70|     15|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  4| 060|     70|     15|      0|      0|
+---+----+-------+-------+-------+-------+


scala> val dfWithVals = df.withColumn("values", struct($"value_1", $"value_2", $"value_3", $"value_4"))
dfWithVals: org.apache.spark.sql.DataFrame = [id: int, code: string ... 5 more fields]

scala> dfWithVals.show
+---+----+-------+-------+-------+-------+---------------+
| id|code|value_1|value_2|value_3|Value_4|         values|
+---+----+-------+-------+-------+-------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  1| 052|      0|      0|      0|      0|   [0, 0, 0, 0]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]|
|  2| 054|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|
|  3| 056|    100|     10|      0|      0|[100, 10, 0, 0]|
|  3| 057|    100|     20|      0|      0|[100, 20, 0, 0]|
|  4| 058|     70|     15|      0|      0| [70, 15, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|
|  4| 060|     70|     15|      0|      0| [70, 15, 0, 0]|
+---+----+-------+-------+-------+-------+---------------+


scala> val overColumns =org.apache.spark.sql.expressions.Window.partitionBy("id")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@de0daca

scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").show
+---+----+-------+-------+-------+-------+---------------+---------------+      
| id|code|value_1|value_2|value_3|Value_4|         values|        maxvals|
+---+----+-------+-------+-------+-------+---------------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|[100, 50, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|[70, 15, 0, 20]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]| [10, 0, 10, 0]|
+---+----+-------+-------+-------+-------+---------------+---------------+



scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").drop("values", "maxvals").show
+---+----+-------+-------+-------+-------+                                      
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  2| 053|     10|      0|     10|      0|
+---+----+-------+-------+-------+-------+

Answer 2

If the data is in a form that algorithm is guaranteed to select always one column, the following code produces the expected result:如果数据采用算法保证 select 始终为一列的形式，则以下代码会产生预期结果：

val w = Window.partitionBy("id")

var df2 = df;
val cols = Seq("value_1", "value_2", "value_3", "value_4")
for( col <- cols) {
  df2 = df2.withColumn(s"${col}_max", max(col).over(w))
    .withColumn(s"${col}_avg", avg(col).over(w))
}

var sel = ""
for( col <- cols) {
  sel += s"(${col}_max <> ${col}_avg and ${col} = ${col}_max) or"
}
sel.dropRight(2)

df2.filter(sel).select("id", ("code" +: cols):_*).sort("id", "code").show

比较当前行和上一行的值，如果在 Spark 中需要，则比较列的值

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-05 19:40:32

解决方案2
1 2020-06-05 19:41:20

比较当前行和上一行的值，如果在 Spark 中需要，则比较列的值

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-05 19:40:32

解决方案2 1 2020-06-05 19:41:20

解决方案1
1 已采纳 2020-06-05 19:40:32

解决方案2
1 2020-06-05 19:41:20