convert this sql left-join query to spark dataframes (scala)

Question

I have this sql query which is a left-join and has a select statement in the beginning which chooses from the right table columns as well.. Can you please help to convert it to a spark dataframes and get the result using spark-shell? I don't want to use the sql code in spark instead I want to use dataframes.

I know the join syntax in scala, but I don't know how to choose from the right table (here it is count(w.id2)) when left join doesn't have access to the right table's columns.

Thank you!

select count(x.user_id) user_id_count, count(w.id2) current_id2_count
from
    (select
        user_id
    from
        tb1
    where
        year='2021'
        and month=1
        
    ) x
left join
    (select id1, max(id2) id2 from tb2 group by id1) w
on
    x.user_id=w.id1;

In spark I would create two dataframes x and w and join them:

var x = spark.sqlContext.table("tb1").where("year='2021' and month=1")
var w= spark.sqlContext.table("tb2").groupBy("id1").agg((max("id2").alias("id2"))
var joined = x.join(w, x("user_id")===w("id1"), "left")

Answer 1

Your request is quite difficult to understand, however I am gonna try to reply taking the SQL code you provided as baseline and reproduce it with Spark.

// Reading tb1 (x) and filtering for Jan 2021, selecting only "user_id"
val x: DataFrame = spark.read
  .table("tb1")
  .filter(col("year") === "2021")
  .filter(col("mont") === "01")
  .select("user_id")

// Reading tb2 (w) and for each "id1" getting the max "id2"
val w: DataFrame = spark.read
  .table("tb2")
  .groupBy(col("id1"))
  .max("id2")

// Joining tb1 (x) and tb2 (w) on "user_id" === "id1", then counting user_id and id2
val xJoinsW: DataFrame = x
  .join(w, x("user_id") === w("id1"), "left")
  .select(count(col("user_id").as("user_id_count")), count(col("id2").as("current_id2_count")))

A small but relevant remark, as you're using Scala and Spark, I would suggest you to use val and not var . val means it's final, cannot be reassigned, whereas, var can be reassigned later. You can read more here .

Lastly, feel free to change the Spark reading mechanism with whatever you like.

convert this sql left-join query to spark dataframes (scala)

Question

1 answers

solution1
1 2021-11-07 10:59:37

convert this sql left-join query to spark dataframes (scala)

Question

1 answers

solution1 1 2021-11-07 10:59:37

solution1
1 2021-11-07 10:59:37