Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date?

Question

I am trying to join two datasets of meters readings in Spark SQL using joinWith, so that the returned type is Dataset[(Reading, Reading)]. The goal is to match each row in the first dataset (called Current) with its previous record in the second dataset (called Previous), based on a date column.

I need to first join on the meter key, and then join by comparing date, finding the next largest date that is smaller than the current reading date (ie the previous reading).

Here is what I have tried, but I think this is too trivial. I am also getting a 'Can't resolve' error with MAX.

val joined = Current.joinWith(
      Previous,
      (Current("Meter_Key") === Previous("Meter_Key"))
        && (Current("Reading_Dt_Key") > MAX(Previous("Reading_Dt_Key"))
    )

Can anyone help?

Answer 1

Did not try to use LAG, think that would also work. But looked at your requirement with a joinWith and decided to do apply some logic for performance reasons. Many Steps in Jobs skipped. Used different names, you can abstract, rename and drop cols.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._

case class mtr0(mtr: String, seqNum: Int)
case class mtr(mtr: String, seqNum: Int, rank: Int)

// Gen data & optimize for JOINing, just interested in max 2 records for ranked sets.
val curr0 = Seq(
mtr0("m1", 1),
mtr0("m1", 2),
mtr0("m1", 3),
mtr0("m2", 7)
).toDS

val curr1 = curr0.withColumn("rank", row_number()
                 .over(Window.partitionBy($"mtr").orderBy($"seqNum".desc)))

// Reduce before JOIN.
val currF=curr1.filter($"rank" === 1 ).as[mtr]
//currF.show(false) 
val prevF=curr1.filter($"rank" === 2 ).as[mtr]
//prevF.show(false) 

val selfDF = currF.as("curr").joinWith(prevF.as("prev"),
( col("curr.mtr") === col("prev.mtr") && (col("curr.rank") === 1) && (col("prev.rank") === 2)),"left")

// Null value evident when only 1 entry per meter.
selfDF.show(false)

returns:

+----------+----------+
|_1        |_2        |
+----------+----------+
|[m1, 3, 1]|[m1, 2, 2]|
|[m2, 7, 1]|null      |
+----------+----------+

selfDF: org.apache.spark.sql.Dataset[(mtr, mtr)] = [_1: struct<mtr: string, seqNum: int ... 1 more field>, _2: struct<mtr: string, seqNum: int ... 1 more field>]

Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date?

Question

1 answers

solution1
0 2020-01-16 21:09:58

Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date?

Question

1 answers

solution1 0 2020-01-16 21:09:58

solution1
0 2020-01-16 21:09:58