[英]Compare Value of Current and Previous Row in Spark
I am trying to compare record of current and previous row in the below DataFrame
. 我正在尝试比较下面
DataFrame
中当前行和上一行的DataFrame
。 I want to calculate the Amount column. 我要计算“金额”列。
scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT")
scala> dataset.show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 70|
| 4|789| 80|
+-----+---+------+
Calculation Logic: 计算逻辑:
Same logic need to follow for the other rows also. 其他行也需要遵循相同的逻辑。
Expected Output: 预期产量:
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 30|
| 4|789| 80|
+-----+---+------+
Please help. 请帮忙。
You could use lag
with when.otherwise
, here is a demonstration: 您可以在
when.otherwise
使用lag
,这是一个演示:
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy($"SL_NO")
dataset.withColumn("AMOUNT",
when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT")
).show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
| 1|123| 50|
| 2|456| 30|
| 3|456| 30|
| 4|789| 80|
+-----+---+------+
Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be Window.orderBy($"SL_NO").partitionBy($"ID")
depending on your actual problem and whether IDs are sorted together. 注意:由于此示例不使用任何分区,因此在实际数据中可能会出现性能问题,如果可以通过某些变量对问题进行分区
Window.orderBy($"SL_NO").partitionBy($"ID")
具体取决于您的实际问题以及ID是否一起排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.