简体   繁体   English

比较Spark中当前行和上一行的值

[英]Compare Value of Current and Previous Row in Spark

I am trying to compare record of current and previous row in the below DataFrame . 我正在尝试比较下面DataFrame中当前行和上一行的DataFrame I want to calculate the Amount column. 我要计算“金额”列。

scala> val dataset = sc.parallelize(Seq((1, 123, 50), (2, 456, 30), (3, 456, 70), (4, 789, 80))).toDF("SL_NO","ID","AMOUNT")

scala> dataset.show
+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    70|
|    4|789|    80|
+-----+---+------+

Calculation Logic: 计算逻辑:

  1. For the row no 1, AMOUNT should be 50 from first row. 对于第1行,AMOUNT应该从第一行开始为50。
  2. For the row no 2, if ID of SL_NO - 2 and 1 is not same then need to consider AMOUNT of SL_NO - 2 (ie - 30). 对于第2行,如果SL_NO-2和1的ID不相同,则需要考虑SL_NO-2的AMOUNT(即-30)。 Otherwise AMOUNT of SL_NO - 1 (ie - 50) 否则为SL_NO的AMOUNT-1(即-50)
  3. For the row no 3, if ID of SL_NO - 3 and 2 is not same then need to consider AMOUNT of SL_NO - 3 (ie - 70). 对于第3行,如果SL_NO-3和2的ID不相同,则需要考虑SL_NO-3的AMOUNT(即-70)。 Otherwise AMOUNT of SL_NO - 2 (ie - 30) 否则为SL_NO的AMOUNT-2(即-30)

Same logic need to follow for the other rows also. 其他行也需要遵循相同的逻辑。

Expected Output: 预期产量:

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Please help. 请帮忙。

You could use lag with when.otherwise , here is a demonstration: 您可以在when.otherwise使用lag ,这是一个演示:

import org.apache.spark.sql.expressions.Window

val w = Window.orderBy($"SL_NO")
dataset.withColumn("AMOUNT", 
    when($"ID" === lag($"ID", 1).over(w), lag($"AMOUNT", 1).over(w)).otherwise($"AMOUNT")
).show

+-----+---+------+
|SL_NO| ID|AMOUNT|
+-----+---+------+
|    1|123|    50|
|    2|456|    30|
|    3|456|    30|
|    4|789|    80|
+-----+---+------+

Note: since this example doesn't use any partition, it could have performance problem, in your real data, it would be helpful if your problem can be partitioned by some variables, may be Window.orderBy($"SL_NO").partitionBy($"ID") depending on your actual problem and whether IDs are sorted together. 注意:由于此示例不使用任何分区,因此在实际数据中可能会出现性能问题,如果可以通过某些变量对问题进行分区Window.orderBy($"SL_NO").partitionBy($"ID")具体取决于您的实际问题以及ID是否一起排序。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较当前行和上一行的值,如果在 Spark 中需要,则比较列的值 - Compare Value of Current and Previous Row, and after for Column if required in Spark 在Spark Scala中将当前行中的前一行值求和 - sum previous row value in current row in spark scala 使用 Scala 根据前一行中不同列的计算值计算 Spark Dataframe 当前行中的列值 - Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala Scala Spark DataFrame 问题:如何通过将当前行中的值与前一行中的某处匹配来添加新列 - Scala Spark DataFrame Question:How to add new columns by matching the value in current row to somewhere from previous rows 根据spark中上一行的同一列的值计算值 - Calculate value based on value from same column of the previous row in spark 使用Spark Dataframe遍历记录并根据某些条件将当前值与先前值合并 - Using Spark Dataframe to iterate through records and concat the current value with previous value based on some condition Scala Spark Dataframe 创建一个新列,其中包含另一列的最大值和当前值 - Scala Spark Dataframe Create a new column with maximum of previous and current value of another column Spark / Scala:根据当前行中的值变化将行拆分为几行 - Spark / Scala: Split row into several rows based on value change in current row Spark Dataframe访问先前计算的行 - Spark Dataframe access of previous calculated row 如何在Apache Spark中获取上一行的数据 - How to get data of previous row in Apache Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM