I am working on some requirement in which I am getting one small table in from of CSV file as follow:
root
|-- ACCT_NO: string (nullable = true)
|-- SUBID: integer (nullable = true)
|-- MCODE: string (nullable = true)
|-- NewClosedDate: timestamp (nullable = true
We also have a very big external hive table in form of Avro which is stored in HDFS as follow:
root
-- accountlinks: array (nullable = true)
| | |-- account: struct (nullable = true)
| | | |-- acctno: string (nullable = true)
| | | |-- subid: string (nullable = true)
| | | |-- mcode: string (nullable = true)
| | | |-- openeddate: string (nullable = true)
| | | |-- closeddate: string (nullable = true)
Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE
. If it matches, updates the accountlinks.account.closeddate
with NewClosedDate
from CSV file.
I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.
val df = spark.sql("select * from db.table where archive='201711'")
val ExtractedColumn = df
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))
val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")
val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")
All you need is to explode the accountlinks
array and then join the 2 dataframes like this:
val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF = explodedDF.join(ReferenceData, joinCondition, "left")
Now you can update the account
struct column like below, and collect list to get back the array structure:
val FinalData = joinDF.withColumn("account",
struct($"account.acctno", $"account.subid", $"account.mcode",
$"account.openeddate", $"NewClosedDate".alias("closeddate")
)
)
.groupBy().agg(collect_list($"account").alias("accountlinks"))
The idea is to create a new struct with all the fields from account
except closedate
that you get from NewCloseDate
column.
If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.