简体   繁体   中英

Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date?

I am trying to join two datasets of meters readings in Spark SQL using joinWith, so that the returned type is Dataset[(Reading, Reading)]. The goal is to match each row in the first dataset (called Current) with its previous record in the second dataset (called Previous), based on a date column.

I need to first join on the meter key, and then join by comparing date, finding the next largest date that is smaller than the current reading date (ie the previous reading).

Here is what I have tried, but I think this is too trivial. I am also getting a 'Can't resolve' error with MAX.

val joined = Current.joinWith(
      Previous,
      (Current("Meter_Key") === Previous("Meter_Key"))
        && (Current("Reading_Dt_Key") > MAX(Previous("Reading_Dt_Key"))
    )

Can anyone help?

Did not try to use LAG, think that would also work. But looked at your requirement with a joinWith and decided to do apply some logic for performance reasons. Many Steps in Jobs skipped. Used different names, you can abstract, rename and drop cols.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._

case class mtr0(mtr: String, seqNum: Int)
case class mtr(mtr: String, seqNum: Int, rank: Int)

// Gen data & optimize for JOINing, just interested in max 2 records for ranked sets.
val curr0 = Seq(
mtr0("m1", 1),
mtr0("m1", 2),
mtr0("m1", 3),
mtr0("m2", 7)
).toDS

val curr1 = curr0.withColumn("rank", row_number()
                 .over(Window.partitionBy($"mtr").orderBy($"seqNum".desc)))

// Reduce before JOIN.
val currF=curr1.filter($"rank" === 1 ).as[mtr]
//currF.show(false) 
val prevF=curr1.filter($"rank" === 2 ).as[mtr]
//prevF.show(false) 

val selfDF = currF.as("curr").joinWith(prevF.as("prev"),
( col("curr.mtr") === col("prev.mtr") && (col("curr.rank") === 1) && (col("prev.rank") === 2)),"left")

// Null value evident when only 1 entry per meter.
selfDF.show(false)

returns:

+----------+----------+
|_1        |_2        |
+----------+----------+
|[m1, 3, 1]|[m1, 2, 2]|
|[m2, 7, 1]|null      |
+----------+----------+

selfDF: org.apache.spark.sql.Dataset[(mtr, mtr)] = [_1: struct<mtr: string, seqNum: int ... 1 more field>, _2: struct<mtr: string, seqNum: int ... 1 more field>]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM