Join two spark Dataframe using the nested column and update one of the columns

Question

I am working on some requirement in which I am getting one small table in from of CSV file as follow:

root
 |-- ACCT_NO: string (nullable = true)
 |-- SUBID: integer (nullable = true)
 |-- MCODE: string (nullable = true)
 |-- NewClosedDate: timestamp (nullable = true

We also have a very big external hive table in form of Avro which is stored in HDFS as follow:

root
-- accountlinks: array (nullable = true)
 |    |    |-- account: struct (nullable = true)
 |    |    |    |-- acctno: string (nullable = true)
 |    |    |    |-- subid: string (nullable = true)
 |    |    |    |-- mcode: string (nullable = true)
 |    |    |    |-- openeddate: string (nullable = true)
 |    |    |    |-- closeddate: string (nullable = true)

Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE . If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.

I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.

 val df = spark.sql("select * from db.table where archive='201711'")

val ExtractedColumn = df 
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))

val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")

val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")

Answer 1

All you need is to explode the accountlinks array and then join the 2 dataframes like this:

val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF  = explodedDF.join(ReferenceData, joinCondition, "left")

Now you can update the account struct column like below, and collect list to get back the array structure:

val FinalData = joinDF.withColumn("account", 
                                  struct($"account.acctno", $"account.subid", $"account.mcode", 
                                         $"account.openeddate", $"NewClosedDate".alias("closeddate")
                                        )
                                 )
                        .groupBy().agg(collect_list($"account").alias("accountlinks"))

The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.

If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.

Join two spark Dataframe using the nested column and update one of the columns

Question

1 answers

solution1
2 2020-01-13 10:12:52

Join two spark Dataframe using the nested column and update one of the columns

Question

1 answers

solution1 2 2020-01-13 10:12:52

solution1
2 2020-01-13 10:12:52