比较Spark中当前行和上一行的值

Question

I am trying to compare record of current and previous row in the below DataFrame . 我正在尝试比较下面DataFrame中当前行和上一行的DataFrame 。 I want to calculate the Amount column. 我要计算“金额”列。

scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT")

scala> dataset.show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    70|
|    4|789|    80|
+-----+---+------+

Calculation Logic: 计算逻辑：

For the row no 1, AMOUNT should be 50 from first row. 对于第1行，AMOUNT应该从第一行开始为50。
For the row no 2, if ID of SL_NO - 2 and 1 is not same then need to consider AMOUNT of SL_NO - 2 (ie - 30). 对于第2行，如果SL_NO-2和1的ID不相同，则需要考虑SL_NO-2的AMOUNT（即-30）。 Otherwise AMOUNT of SL_NO - 1 (ie - 50) 否则为SL_NO的AMOUNT-1（即-50）
For the row no 3, if ID of SL_NO - 3 and 2 is not same then need to consider AMOUNT of SL_NO - 3 (ie - 70). 对于第3行，如果SL_NO-3和2的ID不相同，则需要考虑SL_NO-3的AMOUNT（即-70）。 Otherwise AMOUNT of SL_NO - 2 (ie - 30) 否则为SL_NO的AMOUNT-2（即-30）

Same logic need to follow for the other rows also. 其他行也需要遵循相同的逻辑。

Expected Output: 预期产量：

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Please help. 请帮忙。

Answer 1

You could use lag with when.otherwise , here is a demonstration: 您可以在when.otherwise使用lag ，这是一个演示：

import org.apache.spark.sql.expressions.Window

val w = Window.orderBy($"SL_NO")
dataset.withColumn("AMOUNT", 
    when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT")
).show

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be Window.orderBy($"SL_NO").partitionBy($"ID") depending on your actual problem and whether IDs are sorted together. 注意：由于此示例不使用任何分区，因此在实际数据中可能会出现性能问题，如果可以通过某些变量对问题进行分区Window.orderBy($"SL_NO").partitionBy($"ID")具体取决于您的实际问题以及ID是否一起排序。

比较Spark中当前行和上一行的值

问题描述

1 个解决方案

解决方案1
6 已采纳 2017-09-13 12:50:33

比较Spark中当前行和上一行的值

问题描述

1 个解决方案

解决方案1 6 已采纳 2017-09-13 12:50:33

解决方案1
6 已采纳 2017-09-13 12:50:33