简体   繁体   English

比较当前行和上一行的值,如果在 Spark 中需要,则比较列的值

[英]Compare Value of Current and Previous Row, and after for Column if required in Spark

I am trying to select the value of one column based on the values of other rows and other columns.我正在尝试 select 基于其他行和其他列的值的一列的值。

scala> val df = Seq((1,"051",0,0,10,0),(1,"052",0,0,0,0),(2,"053",10,0,10,0),(2,"054",0,0,10,0),(3,"055",100,50,0,0),(3,"056",100,10,0,0),(3,"057",100,20,0,0),(4,"058",70,15,0,0),(4,"059",70,15,0,20),(4,"060",70,15,0,0)).toDF("id","code","value_1","value_2","value_3","Value_4")
scala> df.show()
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0| 
|  3| 057|    100|     20|      0|      0| 
|  4| 058|     70|     15|      0|      0| 
|  4| 059|     70|     15|      0|     20| 
|  4| 060|     70|     15|      0|      0| 
+---+----+-------+-------+-------+-------+

Calculation Logic:计算逻辑:

Select a code for an id, following the steps Select 一个 id 的代码,按照步骤

  1. For each column n(value_1,value_2,value_3,value_4), do对于每一列 n(value_1,value_2,value_3,value_4),执行
  2. For the same id look for the maximum value in the value_n column对于相同的 id,在 value_n 列中查找最大值
  3. If the maximum value is repeated, the next column is evaluated如果最大值重复,则评估下一列
  4. Otherwise, if the maximum value is found without repetition, the id and the code of the column with the maximum value are taken.否则,如果没有重复找到最大值,则取最大值列的id和code。 It is no longer necessary to evaluate the following columns.不再需要评估以下列。

Expected Output:预期 Output:

+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  2| 053|     10|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
+---+----+-------+-------+-------+-------+

In case of id 3:如果是 id 3:

  • It has the codes 055, 056, 057它的代码为 055、056、057
  • value_1 has the values 100 for all three codes, so the maximum value is 100 but it repeats for all three codes, I can't select a code. value_1 所有三个代码的值都是 100,因此最大值为 100,但它对所有三个代码都重复,我不能 select 代码。
  • The value_2 column has to be evaluated, which has the values 50,10 and 20 for each code respectively必须评估 value_2 列,每个代码的值分别为 50,10 和 20
  • So the maximum value is 50 among the three codes, and it is unique.所以三个码中最大值为50,是唯一的。
  • The record with id 3 and code 055 is selected id 为 3 且代码为 055 的记录被选中

Please help.请帮忙。

You can put your value_1 to 4 in a struct and call max function on it groupedBy id column using window您可以将 value_1 设置为 4 并使用 window 在其 groupedBy id 列上调用 max function


scala> df.show
+---+----+-------+-------+-------+-------+
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  1| 052|      0|      0|      0|      0|
|  2| 053|     10|      0|     10|      0|
|  2| 054|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  3| 056|    100|     10|      0|      0|
|  3| 057|    100|     20|      0|      0|
|  4| 058|     70|     15|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  4| 060|     70|     15|      0|      0|
+---+----+-------+-------+-------+-------+


scala> val dfWithVals = df.withColumn("values", struct($"value_1", $"value_2", $"value_3", $"value_4"))
dfWithVals: org.apache.spark.sql.DataFrame = [id: int, code: string ... 5 more fields]

scala> dfWithVals.show
+---+----+-------+-------+-------+-------+---------------+
| id|code|value_1|value_2|value_3|Value_4|         values|
+---+----+-------+-------+-------+-------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  1| 052|      0|      0|      0|      0|   [0, 0, 0, 0]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]|
|  2| 054|      0|      0|     10|      0|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|
|  3| 056|    100|     10|      0|      0|[100, 10, 0, 0]|
|  3| 057|    100|     20|      0|      0|[100, 20, 0, 0]|
|  4| 058|     70|     15|      0|      0| [70, 15, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|
|  4| 060|     70|     15|      0|      0| [70, 15, 0, 0]|
+---+----+-------+-------+-------+-------+---------------+


scala> val overColumns =org.apache.spark.sql.expressions.Window.partitionBy("id")
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@de0daca

scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").show
+---+----+-------+-------+-------+-------+---------------+---------------+      
| id|code|value_1|value_2|value_3|Value_4|         values|        maxvals|
+---+----+-------+-------+-------+-------+---------------+---------------+
|  1| 051|      0|      0|     10|      0|  [0, 0, 10, 0]|  [0, 0, 10, 0]|
|  3| 055|    100|     50|      0|      0|[100, 50, 0, 0]|[100, 50, 0, 0]|
|  4| 059|     70|     15|      0|     20|[70, 15, 0, 20]|[70, 15, 0, 20]|
|  2| 053|     10|      0|     10|      0| [10, 0, 10, 0]| [10, 0, 10, 0]|
+---+----+-------+-------+-------+-------+---------------+---------------+



scala> dfWithVals.withColumn("maxvals", max($"values").over(overColumns)).filter($"values" === $"maxvals").drop("values", "maxvals").show
+---+----+-------+-------+-------+-------+                                      
| id|code|value_1|value_2|value_3|Value_4|
+---+----+-------+-------+-------+-------+
|  1| 051|      0|      0|     10|      0|
|  3| 055|    100|     50|      0|      0|
|  4| 059|     70|     15|      0|     20|
|  2| 053|     10|      0|     10|      0|
+---+----+-------+-------+-------+-------+

If the data is in a form that algorithm is guaranteed to select always one column, the following code produces the expected result:如果数据采用算法保证 select 始终为一列的形式,则以下代码会产生预期结果:

val w = Window.partitionBy("id")

var df2 = df;
val cols = Seq("value_1", "value_2", "value_3", "value_4")
for( col <- cols) {
  df2 = df2.withColumn(s"${col}_max", max(col).over(w))
    .withColumn(s"${col}_avg", avg(col).over(w))
}

var sel = ""
for( col <- cols) {
  sel += s"(${col}_max <> ${col}_avg and ${col} = ${col}_max) or"
}
sel.dropRight(2)

df2.filter(sel).select("id", ("code" +: cols):_*).sort("id", "code").show

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较Spark中当前行和上一行的值 - Compare Value of Current and Previous Row in Spark 使用 Scala 根据前一行中不同列的计算值计算 Spark Dataframe 当前行中的列值 - Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala 在Spark Scala中将当前行中的前一行值求和 - sum previous row value in current row in spark scala 根据spark中上一行的同一列的值计算值 - Calculate value based on value from same column of the previous row in spark Scala Spark Dataframe 创建一个新列,其中包含另一列的最大值和当前值 - Scala Spark Dataframe Create a new column with maximum of previous and current value of another column Scala Spark DataFrame 问题:如何通过将当前行中的值与前一行中的某处匹配来添加新列 - Scala Spark DataFrame Question:How to add new columns by matching the value in current row to somewhere from previous rows 添加一列火花 dataframe 包含当前行的所有列名的列表,其值不是 null - Add a column to spark dataframe which contains list of all column names of the current row whose value is not null Spark Scala-比较列值与Paramether - Spark scala - compare column value with paramether 根据上一行的旧值在列上执行火花计算 - spark do calculation on column based on old values of previous row 基于前一列的Spark Df Check Column值 - Spark Df Check Column value based on the previous column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM